8 June 2017

Business Intelligence & General Data Protection Regulation (GDPR)

After four years of discussion and preparation, the European General Data Protection Regulation (GDPR) has been finally approved by the European Parliament on April 14th 2016. It has come into effect twenty days after being published on the Official Journal of the European Union and will be directly applicable to all member states two years after, on May 25th 2018, a moment from which all organizations still not compliant will be confronted with heavy penalties.

GDPR, which replaces the Data Protection Directive 95/46/EC, has been designed to fill a gap by streamlining related laws at the European level, responding to a bigger demand and empowerment from citizens and, finally, to hold accountable all entities that retrieve, treat and store personal data.

GDPR goal is to protect EU citizens from data and privacy breaches in a world with increasingly more information, completely different from the world in which the 1995 directive has been created. Since the data privacy principles are still the same, several changes have been proposed to the current policy. Amongst the proposed changes, we can consider that the most relevant relate to the following aspects:

Geographical scope:

> Broader scope of territory (applicability beyond borders);

> Applicable to all entities that are processing personal data from EU citizens, regardless of the location of that entity.


> Entities found on violation of GDPR are subject to fines that can go up to 4% of their global yearly revenue or 20 million euros.


> The request for consent on treating data must be presented to the citizen on a clear and easy to understand manner.

In order to ensure compliance with the new regulation, all entities that are under it should adapt, using workgroups in order to address critical issues, such as:

> What are the business processes that use data that is covered under this regulation?

> What are the operating systems and analytics in which these data reside?

> What is the information lifecycle?

> What are the transformation processes applied to this data?

> Who accesses information in each system?

Within the scope of this discussion, we will be focusing on the analytics systems of organizations. Typically, the data analytics of one organization are used to, among others, take operational and strategic decisions, reporting to regulators, find market trends or even to predict future events based on past behaviors. We can therefore assume that a large part of these data are from the organization itself, such as data related to all actions done by its customers (orders, payments, etc.), but also enriched with information from external providers such as, for instance, the macro indicators related to the market in which the company operates, allowing it to compare its performance with that of its peers.

These data analytics are usually available on data warehouses or data lakes, and are used through analytics tools, but there are still many organizations where this does not happen yet, and in which this data still resides on Excel files where the control of information access is not so effective. Controlling the access to this information is a path that can be taken so that an organization can comply with the new regulation, but it will always be a path that tries to remedy a situation that, for all instances, should not occur from the start.

Contrary to a Customer Relationship Management (CRM) system, where it is needed to know uniquely customers and prospects, specifically names, contacts and addresses, and to develop direct contact campaigns in order to increase the customer base or to develop new business with up-sell and cross-sell campaigns, the analytics systems do not require such information.

Organizations can follow several paths to reach compliance with the regulation, once again taking into consideration that the discussion should be focused on the analytics systems. From the many possibilities available, we will be focusing our attention on two possible scenarios:

> To apply data masking techniques, only for sensitive data, as they enter analytics systems (scenario 1);

> To prevent that information from entering analytics systems (scenario 2).

In truth, both typical analysis as well as analysis leveraging advanced techniques (neural networks, cluster analysis, semantic analysis) do not need data that uniquely identify one person. We do not need names or tax IDs for this type of analysis, what is really relevant is to know characteristics such as gender, age, or the location where the customer lives. Regardless, these questions are never simple, and we can be confronted with situations in which the analytics systems provide data to CRM systems / operational systems. As an example, after customer segmentations where we find that 5% are at risk of giving up negotiated services, the organization might want to develop specific campaigns for those customers, and to achieve that it is crucial to know them. The easier way is to direct that customer list to the above-mentioned systems, and it cannot be sent fictitious data. There are several techniques for masking data, including techniques that allow data to be unmasked through a key. This is the right option in case the organization chooses to develop a data masking strategy.

Scenario 1 – Data masking

On the other hand, if this is the path chosen, and if the sensitive data is not present at first on the analytics systems, how can one organization channel the needed information according the above-mentioned scenario? Before analyzing the situation, it is relevant to highlight that both operational and analytics systems ensure the uniqueness of records in different ways:

> Operational systems: natural or business keys are used, which are created during normal business processes (v.g. invoice number);

> Analytics systems: one good practice on these systems is the creation of a replacement key, which usually is a whole number. These systems keep the correspondence between the replacement key and the natural key / business.

Another best practice on the development of an analytics system architecture is the creation of an Operation Data Store (ODS). This database will be able to, among others, be used for the transfer of data from the operational to the analytics system, and vice-versa. Getting back to the aforementioned scenario related to the inexistence of the information needed for the unique identification of a person within the analytics system, this information can only reside on the ODS. The ODS is a system in which accesses can be better controlled, once typically only systems, and globally only automatic systems, can access it as a way to transfer the needed information. Once the analytics system has the correspondence between the natural key / business and the replacement key, the list the CRM needs and that has been identified within the Data Warehouse can be sent to the ODS, in this case only the list of keys, and ODS will take the lead to send to the destiny system the information that really identifies the customer.

Scenario 2 – Non-inclusion of sensitive data in analytical systems

So far, we have been focused on production environments, in which the business is really developed and analyzed, but many organizations have non-productive environments, usually development environments where new functionalities are developed and quality environments where those developments are tested prior to being migrated to production environments. One of the big challenges of these developments is the fact that they are based on an organization’s reality, meaning that a development is as good as the quality of the data in which it is based. In this instance, there are two possible scenarios:

> Fictitious data: Organizations can choose to provide within these environments data that are not real, even if that puts at risk the quality of the developments, since these are not made taking into consideration a company’s reality, requiring some time and know-how to create the data;

> Real data: Organizations can choose to transfer a real data set directly from the production environment, making sure that developments are made based on an organization’s reality. In the instances in which a policy of data masking is not used in these environments (as is the case with scenario 1, discussed earlier), and if the choice is not to withhold information (as on the aforementioned scenario 2), data are being broadcast to other environments, not forgetting that in many instances these developments are done by external companies specifically hired to perform those tasks, and, in this case, beyond providing people related sensitive data for third-parties, even an organizations own sensitive data can be provided and, as a result, jeopardized.

Getting back to the start of our discussion, organizations will have to adapt to GDPR. Several options exist to be leveraged in order to ensure compliance with the regulation, and there is not one single solution applicable to all organizations. Each case is a case, and within the scope of this article, it was discussed some ways to reconcile the analytics systems of organizations with this regulation. It is now up to each organization to assess what is the best for its particular situation.

      Sérgio Costa
Solutions Development