How Machine Learning Improves Data Cleansing
Machine Learning Improves Data Cleansing
One of the many Holy Grails of machine learning within the spend analysis domain is the ability to disambiguate and classify customer purchases accurately, quickly, and automatically. It’s a fun problem to try to tackle since it’s approachable from many different angles.
At a very simple level, one could iterate through all item purchases and try to categorize each purchase based on the name of the purchase and the name of the available categories to which you’re mapping. As an example, a “spoon” can be mapped to the following UNSPSC categories which all have the word “spoon” in them.
- 41123402 – Dosing Spoon
- 42181512 – Typhoid Carrier Examination Spoons
- 42294000 – Surgical Spatulas and Spoons and Scoops and Related Products
- 42294003 – Surgical Spoons
- 42294519 – Ophthalmic Spoons or Curettes
- 52151617 – Domestic Wooden Spoon
- 52151651 – Domestic Measuring Spoon
- 52151704 – Domestic Spoons
If an automated system were to use this scheme, which category of “spoon” would it select? Hopefully there would be some context in the item description that could provide some hints such as the word “kitchen” or perhaps a supplier where you purchase the spoon such as “Staples”, but that’s an additional layer of complexity that one would have to account for (think lots of $$$).
Using the Machine Learning to classify
Sourcing Force is fortunate enough to have been in the business long enough to have developed a significant edge. Quite simply, we’ve classified a ton of items using custom created classification rules.
When a researcher is toying with machine learning algorithms such as Neural Networks (NN), Naive Bayes Classifiers (NBC), Hidden Markov Models (HMM) for Word Sense Disambiguation, etc., frequently he/she runs into a huge roadblock in that in order to effectively apply these algorithms, one needs training data in order teach and tune the algorithms. Training data for some domains can be purchased while other training data needs to be painfully constructed by the researchers (or probably grad students). It’s not easy to come by in other words.
Our hard working Analysts have to date written hundreds of thousands of distinct classification rules that map item descriptions to category codes that we’ve used to classify items for a lot of companies.
These rules allow us to do a great job classifying items for our clients, but they are also an undeniable treasure trove of implicit semantic knowledge that can be used for algorithm training.
A great example that comes to mind of the “implicit” semantics that I refer to above can be seen in the problem of “how does one classify Tylenol?” There is no UNSPSC code for Tylenol but there is one for Tylenol’s chemical name: Acetaminophen.
The code is 51142001. I fortunately knew that important detail from which I can write a classification rule. Consider this: an algorithm that was trained off these Sourcing Force classification rules just learned that mapping of Tylenol to 51142001 for free. Once upon a time, I wrote a classification rule for a company which I turned around and used to train an algorithm.
Now that rule, to a degree, can help me classify forever.
Figuring out how to classify some items can be quite a nasty puzzle sometimes for a human especially when it comes to chemicals, and so having an Analyst figure out a mapping for an obscure item in a sense becomes “a gift that keeps on giving.”
As an added benefit, the more obscure the item is, the more accurate algorithmic predictions are going to be. The reason for that is the context in which certain items appear for strange purchases is going to be rather limited. There won’t be much “noise” in the data to confuse an automated system.
One must add one last point in order to come full circle within the machine learning domain of spend analysis. Human beings are still masters here. Algorithmic approaches to spend analysis, albeit cool, cannot match the pattern recognition capabilities wired into the human brain – especially a Sourcing Force Analysts’ brain.
Machine learning approaches so far can only mirror what it is that they’ve learned and repeat back answers that have the highest probability of being correct within the limited context that they know. The large number of rules that Sourcing Force has to play with, broaden a machine’s perception of reality and give it a rich context to learn from.
Even though I personally have been the one trying to mature software to do automatic classification, I must give credit where credit is due.
The “parents” of our little electronic child are the Sourcing Force Analysts, none of whom were harmed during the training of any algorithms.
Our latest articles
If contracts are at the center of commerce and business relationships, then integrated data is at the center of those enterprise contracts. You’ve likely heard variations of the phrase “That contract isn’t worth the paper it’s written on.” In the same vein, a contract...
Selecting the optimal location for a sourcing As service delivery destinations have become increasingly globalized over the years, selecting the optimal location for a sourcing engagement has become a more and more complex process. While the number of possible...
A recent article from the Sourcing Force newsletter listed twelve symptoms or indicators that your organization might be ready for a procurement transformation. According to the article, procurement organizations with five or more symptoms were probably ready. What...