30 April 2021

Market Basket Analysis

When it’s said that someone who buys a cereals box for breakfast has a high likelihood of buying milk packages in the same purchase, it’s seems like an obvious and logical sentence once these two products have a high relationship (apparently). But what if there are hidden and unobvious relationships between other products? Walmart, a multinational North-American company, discovered some interesting and unobvious relationships between products in their stores:

  • Frequently young adult men bought beers and baby diapers in the same purchase [1];
  • Before the beginning of a hurricane it was common to exist many purchases with Strawberry Pop-Tarts in addition to batteries and another essential items [2];
  • Customers who bought Barbie dolls would have a good chance of buying one of three types of specific chocolate bars [3].

The discovery of this relationships makes it possible to change a store layout, putting the most associated products frequently nearest of each other. Also, these relationships drive to a better store stock management. For example, a Walmart store could have in stock some chocolate packages (of those that are acquired with Barbie dolls) reaching the expiration date. Once Walmart knows that there is a relationship between Barbie dolls and these chocolate bars purchase, it’s possible to do a promotional campaign in where buying a Barbie doll results in a discount for chocolate bars purchase.

It is also possible to restructure the way products are sold. For example, fast-food chains discovered early, that people who buy fast food tend to feel thirsty due to the salt in the food. Thanks to this, the fast-food chains started to create menus that contained both food and drinks [2].

These relationships are also a good tool to assist in the process of buying online. When purchasing a product from an online store like Amazon, we can see a “Frequently Bought Together” section, where the products sets that are more frequently acquired in the same purchase are presented [4].

Talking about product relationships surely leads to a discussion on Association Rules. These Association Rules are what make up the Market Basket Analysis.

 

Algorithms and Association Rules

Before explaining the algorithms and rules of association, it is necessary to define some key concepts: transaction, basket and itemset. A transaction is an item of a purchase. A basket is a complete purchase and it’s composed by one or more transactions. An itemset is a set of products that is part of a basket (an itemset may not be composed by all products of a basket).

·         Apriori Algorithm

To compute the association rules at Market Basket Analysis, Apriori algorithm can be used. This algorithm finds frequent itemsets through a process called “candidate itemset generation”. It’s an iterative process where k-itemsets are used to explore (k+1)-itemsets and where k corresponds to the number of items [8].

To sum up, in the first place, the algorithm will find the set of frequent 1-itemsets. Then, the algorithm will find the set of frequent 2-itemsets and so on, until there are no more frequent k-itemsets.

Apriori algorithm has some disadvantages such as the fact of being computationally expensive or taking long time to finish a run. It happens because the algorithm needs to iterate many times over the whole dataset to generate the itemsets.

·       FP-Growth Algorithm

To overcome the disadvantages of the Apriori algorithm mentioned above, it was created an improved version of the algorithm called FP-Growth, where “FP” means “Frequent Pattern” [5].

Instead of iterating over the whole dataset like the Apriori algorithm, FP-Growth iterates only twice and uses a tree structure (FP-Tree) to storage the information. Each tree node represents an item and the number of baskets that are originated by that node. Each nodes connection represents an itemset.

For a better understanding of how the algorithm works, each phase will be illustrated below. The follow table presents a dataset in which each line represents a basket with its items.

 

Table 1 – Example of a dataset

 

Firstly, the algorithm will be iterating over the dataset and will count the absolute frequency of each item. The items will be sort by frequency (support count). In case the frequency is the same and they are identified by a textual description, the ordering is done alphabetically. The table below presents the output of this phase.

 

Table 2 – Support Count of the items contained in dataset

 

Then, a tree is built, starting by the root. The root doesn’t contain any elements (it’s null). Items will be added by descending order of support count and also considering the content of each basket. The resultant tree is presented below.

 

Figure 1 – Tree that relates the different products of baskets

 

The last step is to build a conditional pattern table and to do the frequent pattern generation for each item. The construction of this table is done considering the support count in ascending order, i.e., the first item will be the item G, the second item will be the item D and so on.

The first column, i.e., the Conditional Pattern Base of a given item is obtained through the tree above. It is done by discovering the path since the root until the considered item, through connections between nodes. In case of item D, the possible paths are C-A-D and C-A-F-D, being represented by {A, C:1} and {A, C, F:1} respectively. The item D is counted one time in each of the paths.

The second column is called Conditional FP-Tree, where the summation of the counts of each presented item in the discovered paths is done. In case of item D, the Conditional FP-Tree will be obtained joining <A:2>, <C:2> and <F:1>. However, to obtain the strongest Association Rules it is possible to define a minimum support count value, chosen by the analyst. In our example, this value will be 2. Thus, the Conditional FP-Tree for item D will be <A:2, C:2>.

For the third column, called Frequent Pattern Generation, all possible combinations between item D and the other items presented at Conditional FP-Tree are done. In this case, the item D is joined to each of the others items (i.e., items A and C) and finally the final set is obtained where all the three items are joined.

 

Table 3 – Conditional Patterns Table

 

In the table above, there aren’t the Frequent Patterns of item G. It is because none of the items of that tree fulfilled the requirement of having a count bigger or equal than the minimum support count value. We can also note that the items that are directly connected to the tree root aren’t considered to Frequent Pattern Generation process because there isn’t any intermediary node between the root and the nodes that are connected to it.

Thus, the Frequent Patterns that are found for sets of 2 items are {A, D:2}, {C, D:2}, {A, F:2}, {C, F:2}, {B, E:2} and {A, C:3}. For sets of 3 items the Frequent Patterns are {A, C, D:2} and {A, C, F:2}.

It’s important to say that the Generated Frequent Patterns don’t need to be generated only through a unique item. Although we did it in our example, the Conditional Patterns table could be built considering for example the combinations between items G and F or D and F looking for the tree structure.

The rules are obtained considering the Generated Frequent Patterns. From the previous example, for the item D, the obtained rules are A → D, D → A, C → D , D → C, (A,C) → D, (A,D) → C and (C,D) → A.

The strength of these rules is evaluated for some metrics that will be explained in the next section.

 

·       Metrics to evaluate the Association Rules

 

The three most used metrics to evaluate the strength of the Association Rules are support (different of support count), confidence and lift [6].

An Association Rule is established by:

Antecedent -> Consequent

Therefore, let X and Y be two items and let X → Y be the established rule between the two items. The considered metrics, for the considered rule, are computed as follows:

 

Where frq (X, Y) represents the number of baskets that contains items X and Y at the same time and N represents the total number of baskets. In the case of frq (X) and frq(Y), these represent the number of times that item X and item Y are in a basket individually, respectively. These frq (X, Y) , frq (X) and frq (Y) are a representation of Frequent Patterns.

Not all rules are really important or significant and thus, in order to reduce time and the consume of computation resources, it’s necessary to establish some minimum values in the evaluation metrics. Usually these values are established in the values of support and confidence. However, there isn’t a well-established way to choose these minimum values.

If a rule has a low support value it means that there isn’t sufficient information about the relationship between the items that composed the rule. Regarding confidence, it represents a conditional probability. Typically, high values of confidence are preferred. However, these values don’t mean necessarily a strong Association Rule. So, there is lift metric. This metric indicates how significant and strong is a rule. Lift is the increase in probability of having Y given X over support of Y. For considering a rule strong, it’s also necessary that there is a lift value much higher than 1. A lift value lower than 1 means that the fact of having X in a basket doesn’t increase the hypothesis of occurrence of Y, even though that rule is having a high confidence.

 

Use Case

In order to show the use of this technique, it was developed a small example applying Market Basket Analysis with FP-Growth algorithm in a Databricks Workspace. For this, it was used a dataset available in: https://www.kaggle.com/psparks/instacart-market-basket-analysis?select=order_products__train.csv.

There are some Python packages that contain FP-Growth algorithm as well as some approaches describing how to implement Apriori algorithm from scratch. In case of FP-Growth algorithm, it can be found in package PySpark in the follow link: https://spark.apache.org/docs/2.3.0/ml-frequent-pattern-mining.html.

To start, it was built a Spark Dataframe in which each line represents a transaction of an item of a basket, “order_id” is the unique identifier of each basket and “product_name” is the name of the product that is in the basket. The code below shows how these two files were joined [7]. The Figure 2 presents the resultant table.

from pyspark.sql import functions as F
from pyspark.ml.fpm import FPGrowth

productsDF = spark.read.csv(“products.csv”)

ordersDF = spark.read.csv(“order_products__train.csv”)

basketdata = ordersDF.join(productsDF, on=[“product_id”], how=”inner”).select(“order_id”, “product_name”)

 

Figure 2 – Part of resultant table

 

The goal is to consider only a list of unique products in each basket. For that reason, it was removed all the duplicate rows from the table, transforming it so that there is a column with the unique identifier of a basket and another column with a list of all unique products that make up that basket. Now, each row represents a basket. The code below presents how the described process was performed [7]. The Figure 3 shows the resultant table of this process.

basketdata = basketdata.dropDuplicates ([“order_id”, “product_name”]).sort (“order_id”)

basketdata = basketdata.groupBy(“order_id”).agg(F.collect_list(“product_name”)).sort(“order_id”)

 

Figure 3 – Part of resultant table which each line is a basket with a list of unique products that make up that basket

 

Then, it was defined the minimum values for support and confidence to do a first distinction between the rules that are potentially relevant and the rules that won’t have any interest. The minimum support value is very dependent on the available data. There isn’t an exact criteria to do these choices though [7]. For this case, the minimum values chosen were 0.005 for support and 0.1 for confidence.

# support and confidence minimum values choice and fit model
fpGrowth = FPGrowth(itemsCol= “collect_list(product_name)“, minSupport=0.005, minConfidence=0.1)
model = fpGrowth.fit(basketdata)

# most frequent itemsets display
model.freqItemsets.display()
items = model.freqItemsets

# generated association rules display
model.associationRules.display()
rules = model.associationRules

The 7 strongest rules, considering lift metric, were the follow:

Figure 5 – The 7 strongest rules with higher lift value

 

They show the already expected association between fruits or between garlic and onion. And in lines 3 and 4 there is this curious association between Organic Cilantro and Limes (Organic Cilantro → Limes e Limes → Organic Cilantro).

It’s possible to verify that the confidence metric doesn’t show high values. However, the lift values are sufficiently greater than 1 to consider that these rules are strong. One of the factors that could drive to a low confidence value (and also a low support value) may be related to the fact that there is a wide variety of products throughout the dataset. In general, the products are organized by a hierarchical structure with many levels. The greater the specificity the more products are at that level. To get around the wide variety of products, it’s common to apply Market Basket Analysis considering levels with a less specificity, decreasing in this way the variety.

 

To sum up

Market Basket Analysis is an Associate Rule Mining technique widely used by many companies with interest of finding relationships between products that are in their stores. This type of analysis, using transactional data, seeks to verify what are the most frequent patterns, i.e., to find out which products are most often purchased together. Applications range from stock and store management to support promotional campaigns and customer retention strategies.

Initially, it was a technique that was performed with an algorithm called Apriori. Posteriorly, to mitigate the limitations that Apriori presents, regarding time and computation resources, a new algorithm was developed: the FP-Growth algorithm, that can be used through PySpark in a Big Data architecture.

 

 

 
      André Pedrinho                             
      Data Scientist