We can consider the CRISP-DM (Cross-industry standard process for data mining) as the reference process for the construction of Advanced Analytics.
However, in the context of organizations we should interpret the Deployment phase as something that may also involve the inclusion of the knowledge obtained by the process in the traditional analysis (dashboard, reports).
Take a look at the summary of predicted activities on each phase of this methodology.
- The organization intends to create a model that predicts if a customer will be loyal in the future.
Planned activities in each phase of the methodology:
1. Business Understanding e Data Understanding
- Understanding the organization’s business and defining its requirements. For example, we can define that in terms of data history a customer who has not purchased for more than 3 months is no longer considered a loyal customer. This definition is the driver for building a training dataset in which we catalog all customers as loyal and non-loyal;
- Analyse, explore and understand the universe of existing data;
- Analyse data quality;
- Use of statistical techniques of Exploratory Data Analysis and other techniques to describe the nature of the data available in the organization:
– Analysis of location and dispersion metrics;
– Analysis of correlations
– Principal component analysis (dimensionality reduction);
– Cluster Analysis – Cluster analysis is also widely used when we want to get some form of data representation that we do not yet know;
2. Data Preparation
- Construction of a dataset that will serve as basis for the training of the prediction/classification model;
- This dataset should consider all the knowledge gathered in the previous phase and should include treatment for data quality (e.g. NA treatment). It should incorporate conclusions drawn on the nature of the data (e.g. data distribution, correlation analysis, dimensionality reduction, choice of independent variables, etc.);
- The constructed dataset should ideally be the best possible representation of the customer universe under analysis (sample). It should include a set of attributes (independent variables) that describe the universe of data.
- In this phase, several mining algorithms are evaluated, allowing to predict the dependent variable (probability of a customer ceasing to be loyal in the future);
- The use of Model Ensemble techniques and the use of different algorithms (SVM, RandomForests, GLM, Naive Bayes, etc.) are considered;
- The nature of the problem will determine the list of potential algorithms to consider;
- A training dataset (historical customer data) is used to determine the best parameters of each model, and to support the choice of the same.
- To test if the model achieves the proposed objectives (to predict a client’s likelihood of ceasing to be loyal in the future);
- At this stage it is estimated the performance of the model using a set of traditional techniques in data mining (ex: bootstrap);
- Based on the results obtained, the model is evaluated and its implementation is decided.
- Implementation of the model in the production environment (automate and systematize the use of the model);
- The implementation may also involve the inclusion of new knowledge (forecasting a customer’s likelihood of being disincorporated in the future) in existing Dashboards or Reports, or even in CRM tools;
- For example, with this model the customer support can at the time of service know in advance if this customer has a strong probability of not being loyal, and use this information to decide on the offer of new promotions or discounts that can preserve loyalty.
This classic example illustrates a methodology that has been used for more than 20 years, but that is still a good reference for good practices in the construction of this type of analysis.
It should be noted that many of the referred advanced analytical techniques are often used alone in some phases of CRISP-DM. The most typical example is the diversity of techniques used in the Data Understanding phase. For example:
- Advanced Visualization – is used as one of the key tools in analysing large volumes of data or data with high dimensionality. Examples:
– Parallel Coordinate Systems for multidimensional data analysis
– Correlation Matrix
- Network Analysis (Social or other networks) for the detection of unknown communities or relations (used in the famous case of the Panama Papers)
A final reference to Predictive Analytics
In the context of Advanced Analytics, the Predictive Analytics can be considered as the discipline that best synthesizes all the techniques. The predictive analysis encompasses the use of all techniques of statistics, Data Mining/Data Science and Machine Learning in the analysis of historical and cadastral data to build predictive models about the future.
We can thus consider that Predictive Analytics is the basic discipline that supports Advanced Analytics.
Advanced Analytics represent the new era in Business Intelligence & Analytics in organizations. They are a fundamental complement to the traditional analysis, and allow a more preventive or predictive management of the organization, as opposed to a management by reaction (based on analysis of the past).
Creating advanced analyses in an organization requires human resources with specialized knowledge in the mentioned techniques, and a change of culture in organizations (change from reactive management to more preventive/predictive management).
Organizations cannot predict the future, but if they can detect repeating patterns, they can prepare and orient themselves in advance to boost future results.