A market basket analysis or recommendation engine [1] is what is behind all these recommendations we get when we go shopping online or whenever we receive targeted advertising. The underlying engine collects information about people’s habits and knows that if people buy pasta and wine, they are usually also interested in pasta sauces. So, the next time you go to the supermarket and buy pasta and wine, be ready to get a recommendation for some pasta sauce!
A typical analysis goal when applying market basket analysis it to produce a set of association rules in the following form: IF {pasta, wine, garlic} THEN pasta-sauce
The first part of the rule is called “antecedent”, the second part is called “consequent”. A few measures, such as support, confidence, and lift, define how reliable each rule is. The most famous algorithm generating these rules is the Apriori algorithm [2].
We have split this use case into two parts. First we will build the required association rules on a set of example transactions; second, we deploy the rule engine in a productive environment to generate recommendations for new basket data and/or new transactions.
The two workflows can be downloaded from 050_Applications/050016_MarketBasketAnalysis on the EXAMPLES server.
The data set used here was artificially generated with the workflow available in the EXAMPLES server under 007_ModularDataGeneration/007008ShoppingBasket. The main idea behind this workflow was to generate a set of Gaussian random products, a set of Gaussian random basket IDs, and to randomly match them together. After performing a few adjustments on the a priori probabilities, our artificial shopping basket data set is ready to use [3].
This data set consists of two KNIME tables: one containing the transaction data - i.e. the sequences of product IDs in imaginary baskets - and one containing product infos – i.e. name and price.
The central part in building a recommendation engine is the Association Rule Learner node, which implements the Apriori algorithm, in either the traditional [2] or the Borgelt [4] implementation. The Borgelt implementation offers a few performance improvements over the traditional algorithm. The output association rule set, however, remains the same.
Both Association Rule Learner nodes work on a collection of product IDs. A collection is a particular data cell type, assembling together data cells. There are many ways of producing a collection data cell from other data cells [5]: we decided to use the Cell Splitter node. The Cell Splitter node splits strings into smaller substrings according to a delimiter character. The output can assume multiple forms. One of these is a collection column (option “as set (remove duplicates)”). In our case, we split the original transaction string into many product ID substrings, using space as the delimiter character, and collect all product IDs into a collection column.
After running on a data set with past shopping basket examples, the Association Rule Learner node produces a number of rules. Each rule includes a collection of product IDs as antecedent, one product ID as consequent, and a few quality measures, such as support, confidence, and lift.
In Borgelt’s implementation of the Apriori algorithm, three support measures are available for each rule. If A is the antecedent and C is the consequent, then:
Item Set Support tells us how often antecedent A and consequent C are found together in an item set in the whole data set. However, the same antecedent can produce a number of different consequents. So, another measure of the rule quality is how often that antecedent A produces that consequent C among all possible consequents. This is the rule confidence.
Rule Confidence = support(A ∪ C) /support(A)
One more quality measure ‒ the rule lift ‒ tells us how much precise this rule is, compared to just the a priori probability of consequent C.
Rule Lift = Rule Confidence(A → C) / RuleConfidence(∅ → C)
∅ is the whole data set and support(∅) is the number of items/transactions in the data set.
You can make your association rule engine larger or smaller, restrictive or tolerant, by changing a few threshold values in the Association Rule Learner configuration settings, like the “minimum set size”, the “minimum rule confidence”, and the “minimum support” referring to the minimum Item Set Support value.
We also associated a potential revenue to each rule as:
revenue = consequent price x rule item set support
Based on this set of association rules, we can say that if a customer buys wine, pasta, and garlic, (antecedent) usually ‒ or as usually as support says ‒ they also buys pasta-sauce (consequent); we can trust this statement with the confidence percentage that comes with the rule.
After a bit of pre-processing, to add the real product names and prices to the plain IDs, the association rule engine with its antecedents and consequents is saved in the form of a KNIME table formatted file.
Let’s move away now from the dataset with past examples of shopping baskets and into real life. Customer X enters the shop and buys pasta and wine. Are there any other products we can recommend? If our recommendation engine performs its work decently, the customer will thank us for the advice rather than be annoyed by it. We can transform obnoxious advertisement into a win-win situation.
The second workflow prepared for this use case takes the real-life customer basket into consideration and looks for the closest antecedent among all antecedents in the association rule set. The central node of this workflow is the Subset Matcher node.
The Subset Matcher node takes two collection columns as input: the antecedents in the rule set (top input port) and the content of the current shopping basket (lower input port). It then matches the current basket item set with all possible subsets in the rule antecedent item sets. The output table contains pairs of matching cells: the current shopping basket and the totally or partially matching antecedents from the rule set.
By joining the matching antecedents with the rest of the corresponding association rule ‒ that is with consequent, support, confidence, and lift – we obtain the products that could be recommended to customer X, each one with its rule’s confidence, support, revenue, and lift. Only the top 2 consequents, in terms of highest item set support (renamed as rule support), highest confidence, and highest revenue, are retained.
Finally, a short report displays the total price for the current basket and two recommendations, from the two top consequents.
In our case, the recommended product is always lobster, associated once with cookies and once with shrimps. While shrimps and lobster are typically common sense advice, cookies and lobster seem to belong to a more hidden niche of food experts!