23 February 2018

Introduction to Data Mining and the K-MEANS and K-MEDOIDS algorithms

Nowadays, due to rough competition, companies desperately try to assert their position on the market. The competition is undisputed, but the way companies overcome it can be what determines their success. For many years, companies have bet on competitive advantages to acquire market share. However, the goal is not so much oriented to acquire new customers, but to retain the loyal customers and their satisfaction.

As such, the customer plays a preponderant role in the company’s strategic decisions and future. The customers have the power to choose and the companies have a mission to retain or enhance the customer’s interest. We have witnessed along the years the introduction of customer promotion cards, contests and other campaigns to captivate customers’ interests. Yet, many times the campaigns that were too broad and less individual, ended up not captivating the customers and responding to their needs, proving to be inefficient.

Thanks to technology advancements, it is possible to store huge amounts of records in databases and implement solutions in order to produce results that can aid in the decision-making of a company. Hence it is possible to draw profiles of the customers and define areas of action, determine the customers’ preferences, establish consumption associations (when a customer buys product x, he/she is more likely to buy product y), in other words, a very accurate profile. These days, this capacity to draw a customer profile, and know how to preserve his/her loyalty is one of the biggest competitive advantages that a company can have.

Thus, by using a Data Mining descriptive model, it is possible for a company to segment their customers based on their behaviour and purchase patterns. This enables companies to have a deeper understanding of their customers and to be able to launch targeted campaigns, promotions, and as a result, maximize profits.

The descriptive modelling of Data Mining has many stages:

1) Data Preparation
2) Data pre-processing
3) Cluster Analysis
4) Profiling
5) Strategy

When doing Data Preparation, there is a first analysis of the variables that exist in the database. Irrelevant variables are rejected, values such as outliers are imputed in some cases on relevant variables as long as there are no other consistencies on that particular variable, and other data preparation is handled to ensure that the model will have high quality variables.

Data pre-processing consists of transforming variables in order to enrich the model. Grouping variables that belong to the same product category or creating variables that measure monetary value are examples of transformations that can improve the model.

On the Cluster Analysis stage, we enter one of the most crucial parts of the segmentation model. By segmenting the data, we are creating homogeneous groups within each group and heterogeneous groups among those groups. This means that we want to group individuals that have similar characteristics and, among groups, those characteristics differ greatly. It’s important to emphasize that the ultimate goal of this segmentation is to minimize the distance inside each cluster and maximize the distance among the clusters.

The K-MEANS and K-MEDOIDS algorithms allow the grouping of customers (clustering) using the variables that were prepared on the earlier stages of the model.
The K-MEANS algorithm can be divided into a series of steps. For example, let us imagine a data model with only 2 variables in a Cartesian coordinate plane with 2 axis (X,Y):


Each grey point represents an individual (N).

First Step:

To define the number of seeds (K), where K ≤ N, and assign the seeds randomly in the Cartesian plane.


Each star is a seed.

Second Step:

Iteration initiation - each individual is associated with the seed that is closer.


Third Step:

Calculate the centroids of the clusters that were formed based on the mean distance of all the individuals of their respective clusters.


The blue stars represent the new centroids (centralized seed).


Fourth Step:

Return to the second step and associate the individuals to the seeds that are closest.


The individuals marked with a square now belong to the cluster of the same colour.

Fifth Step:

The iterations continue and only end when the centroids can no longer be centralized. At this stage, we have the 4 clusters finalized.

The K-MEDOIDS algorithm behaves in a very similar way to K-MEANS, but rather than having the centroid move using the mean distance of the individuals, the centroid takes the position of the individual that is closest to the centre.
Therefore, on the third step of the K-MEANS example, rather than having the seed move to the position where the blue star is, it will now move to the position of the individual that is nearest to the centre (represented in the purple square).


The Profiling consists on combining the information of the segments of the clusters previously created and selected based on an in-depth analysis using techniques such as Elbow Graphic and the cluster distribution given the number of seeds that were selected.

The Strategy is the final step and it allows the company to perform Marketing strategies and promotion campaigns directed at each cluster, which represents a group of individuals with similar characteristics and purchase patterns.


Data Mining enables companies to obtain patterns of information about their customers with the help of segmentation algorithms applied to descriptive models. This information allows for a better understanding of the customers, which represents a big competitive advantage in the market, essential to maintain or increase the market share.