This article is based on the K-Means implementation and an Association Rules algorithm that will highlight technologies such as R Studio, Tableau Desktop and Cloudera Hadoop.
This implementation is in the context of a real project for a multinational company being made up of tasks such as distribution, customer service, logistics and transportation, and is a crosscutting area for the various sectors of the business.
This research has the aim to solve backorders issues that affect many companies, mostly in the retail sector, and that deserves, more and more, the attention of the stakeholders. An order is defined as a backorder when a product or products of the same order are out-of-stock at the supplier. The primary goal of this implementation is to understand how the company can efficiently respond to the needs of each customer and avoid this problem.
To obtain the proposed conclusions, the tools used were R Studio and Tableau.
In the sections below, the implementation is described in detail, as well as the solutions proposed.
K-means is a clustering algorithm, which consists of data distribution into x segments.
Each segment has a centroid corresponding to the mean, and the elements are distributed according to whether they are closer to a particular value.
For the application of this algorithm was used Tableau, once this tool has the function of clustering of K-means. To do this, an auxiliary table was created in Cloudera with information from Backorders. The use of Cloudera was due to the context of the project mentioned above, which was easier for the connection to the next tools used in the article (Tableau and R Studio) and also because the Data Lake information already suits the needs of the algorithms to be applied. However, Tableau allows the integration of many other sources with structured information such as files, Mysql, Oracle, Teradata, etc.
Before the integration with Tableau, it is necessary to create a table with the essential columns for the application of the K-Means algorithm. In this case, we use the following columns, which are relevant for backorders analysis:
After the table creation, we connected Tableau for Desktop with our cluster, to apply K-Means algorithm.
Once imported the table into Tableau, the fields and the data can be checked in it.
For the K-Means application, it will be necessary to create a new sheet and choose the graph "Dispersion Graph". For the "Lines" and "Columns" was added the sum of the quantities and values respectively since they were the fields chosen for the division of clusters. In "Marks" the field of "SK Customer Shipto" (ID) was placed to better visualize the clients’ distribution into each cluster.
However, for the cluster creation, it is always possible to change the fields used, as well the defined ID, according to the study purpose.
In order to create the clusters, it is only necessary to use the Tableau functionality in the "Analysis" tab and drag the "Cluster" option to the "Marks"; and then add the variables "sum of quantities" and "sum of values" into the cluster. Tableau automatically suggests what the “optimum number of clusters is”, but this value can also be changed manually.
There are several methods to get the correct number of clusters, but the "elbow graph" is the most used. This method compares the sum of quadratic error to the ideal number of clusters. The ideal number of clusters is obtained when the point from which the slope becomes less pronounced. For our sample, given the amount of data, we chose 3 clusters.
To obtain a more homogeneous sample, the outliers were removed, since they are values that are not representative of the sample. The result is presented below, where it is possible to verify the generation of 3 clusters. Cluster 1, the biggest one - representing clients, which the number of backorders and amounts are the lowest, cluster 2 consisting on the average amounts and sums, and, finally, the smallest and most heterogeneous cluster 3 - high number and amount.
The next step was to export the result of each cluster. The data was reimported into Tableau with the purpose to support the association rules, detailed in the following steps.
Association rules show information about things that tend to happen together. Through the association rules, it is possible to understand the behaviour patterns of each cluster previously generated by the K-Means algorithm.
In the first stage of this process, to obtain a focused analysis at the product level, it is essential, for the association rules, to exclude products that only appeared once in the total transactional sample. The exclusion of these products allow the algorithms that create the association rules to not distort the results, as to draw conclusions from a single example. Therefore, a count is taken off and the number of times a product is repeated throughout the sample, that information is subsequently aggregated into a new column.
At this point, we have already added information about backorders that have more than one associated product, allowing us to proceed to the next step that filters, across our sample, by cluster, all the backorders where the product only appeared once.
In the definition of backorders that have more than one associated product, for each cluster, the samples were reduced regarding the rows quantity to realize the association rules. As the number of records in each cluster is insufficient for the desired conclusions, it is essential to change the initial approach, with the analysis to be performed in the total sample. Thus, the number of records is sufficient to make the algorithm more efficient and also flexible in changing parameterizations.
The next step, after running the R Studio program and extracting the data from Tableau to a .csv file, is to import the entire sample into R Studio through the following command:
"Cluster_all" is the defined variable, which contains all sample data imported from the .csv file.
At this stage, with the data already imported into R Studio, a demonstration of the first records in the program can be performed through the following command:
For data and structure validation imported through the file, and to facilitate the accessibility of the objects in R Studio, it is necessary to use the following function:
The attach () function makes each column of the data frame a global variable within R Studio, thus allowing the use of the columns individually for performing operations.
At this moment, with the integrated file as a global variable, in R Studio, the next steps are the installation of the essential package containing the libraries with the essential functions for application and visualization of the algorithms that create the association rules. To install packages in R Studio, just run the following command:
Installing the "Matrix" package may take some time, but as soon as it is completed with success, the "arules" library will be instantly available, which is essential for accessing all functions that the library provides. However, the most important function will be the "a priori". Before parameterization and execution of the algorithm of the association rules creation, there is a need of calling the library with the following command:
The apriori () function, which executes an algorithm for creating association rules on the sample, already has its set of predefined parameterizations. However, it is also possible to manipulate the various parameterizations available by the function, which allow the users, according to their interests, to adapt the level of detail that they want to analyse.
In this example, the parametrizations were focused on the confidence within the algorithm when analysing each association. As the sample, despite being larger than initially, is not very large, in quantity of records, it was necessary to adapt the parameterizations. In addition, the result of the function without changes of variables shows only associations with confidence values on the samples around 1, and the intention would be to obtain a higher diversity in the results.
Three variables were changed in the execution of the following function:
The variable supp expresses the minimum event frequency in any sample that will be included in the algorithm analysis result. In the defined function, only scenarios that reflect at least 1.5% of the information on the total of the sample will be considered.
The variable conf (confidence) express the probability, based on the frequency of events, of a Y scenario to occur when X also occur. As a rule, it is preferable to analyze the cases with confidence values close to 1. In the function, it is defined the scenarios where the minimum confidence of 0.5 is demonstrated.
The variable minlen (minimum size) defines the minimum of items required to apply as association rules. In the function, a minimum of 2 items were determined.
Below is a summary of the results of apriori () function execution that indicates that 246 association rules have been created. The validation of all the rules generated by the algorithm will not be done, so a sort of information can be made according to the measures available (support or confidence, for example). In the example, the ascending order is being done by confidence measure on the results of the apriori algorithm and for this, the following command was executed:
Finally, to display the association rules created by the a priori algorithm and then ordered by confidence, the following command was run:
Due to the high number of rules (246), a filter will be done to reduce the number of rules and thereby decrease the redundancy thereof. In the example will be demonstrated the first 30 rules, with the following command:
As for the result of the association rules created, the analysis metrics are highlighted, at the top, in the result header:
The columns lhs and rhs are the associations generated and analysed by the apriori () algorithm. With a brief example, it is possible to visualize that the value X, represented in the column lhs, is associated with the value Y, in the column rhs.
The support column shows the frequency of association events over the total sample records. A greater support value translates into a higher clarity of results due to the repetitions of the same scenario in our data.
The percentage is obtained with the value of the count/total records column. For example, in line 1, it indicates that in 3% of our entire sample when "Cod_Customer_Shipto_Country" equals "FI" also belongs to Cluster 2.
The confidence column shows the probability of events for each rule created, based on the frequency of events, of a Y event to occur when X also exists. For example, when analyzing line 1 of the obtained result, the algorithm demonstrates, with 52% confidence, that in 3% of the sample where the "Cod_Customer_Shipto_Country" is equal to "FI" also belongs to Cluster 2 (Cluster 2 is detailed at the end of the K-Means process).
Lift measures how often lhs and rhs occur together when compared if they were statistically independent.
When this value is greater than 1, it means that lhs and rhs are correlated, that is, the existence of X necessarily implies the existence of Y.
The count column represents the number of repeated records parsed by the algorithm.
As demonstrated, the a priori function created about 200 association rules, and many of those rules created are redundant. Considering, for example, the following scenario: X happens when Y also occurs in 20% of the sample with 70% of confidence, as it also indicates the inverse, where Y occurs in 20% of the sample for case X also with 70% confidence after the model’s application. Therefore, R Studio allows you to remove redundant cases through the measure of confidence: it indicates that the rule is unnecessary if other equal rules have the same, or smaller, percentage of confidence. The command to remove unnecessary rules is as follows:
The next step is the ascending order of results by the measure of confidence:
Next, all the created rules are demonstrated. However, redundant rules have already been excluded with the following command:
The number of rules was considered reduced from about 200 rules to 89.
In this article, it was possible to verify the application process of K-Means and Association Rules; and in the same, it is possible to check a practical example. Initially, the sample was worked and imported into Tableau, where a dispersion graph was created, and the sample was divided into several clusters. The result was 3 clusters - cluster 1 with low values and quantities, cluster 2 with mean values and quantities and cluster 3 with high quantity values. To apply the association rules, the resulting data from the clusters were extracted to .csv and imported into R Studio. After using a support level of 1.5%, 50% confidence and a minimum number of items of 2, 246 association rules were extracted, and the redundant rules were subsequently removed, which led to a reduction of 89 analysed.
It is important to mention that both, K-Means and Association Rules, are algorithms with such flexibility that allow the context change only with the exchange of columns of the tested sample. In this article, the focus of K-Means was to apply client segmentation by amount and amount associated with backorders, but that could quickly do another type of segmentation using other logical attributes in the business. In the case of Association Rules, there is also this flexibility, which allows analysing many other scenarios of correlation of attributes quite different from those presented here.
As for the tools used for the respective algorithms, Tableau being a visualization tool allows a straightforward application of the K-Means algorithm, as displayed, in a few steps, we were able to define and change the attributes and granularity of the clusters. R Studio also allows you to apply the K-Means but not being so visual, through graphics, does not make it so appealing. However, R Studio has a vast array of libraries with predefined functions, as was used for Association Rules and that, in a short learning time, makes its use more fluid and easy to understand.