20 July 2018

How to use Python to compare automatically data between DWH and Reporting systems?

Importance of ensuring data quality

In a Business Intelligence (BI) system, the Data Warehouse (DWH) presentation layer is where data is organized, stored, and made available for direct querying by users, report writers and other analytical BI applications. By default, all BI applications get access to data available in that layer and the Reporting tools are not an exception. Business users use such tools to analyse data available in the presentation layer and, therefore, extract information that can improve their business decision making. In Reporting tools like Qlik Sense (Qlik), data becomes available after a data refresh process. At this point, the data from Data Warehouse are read directly. However, refreshing data not always occurs with success. When this happens, data from Qlik and database become inconsistent, affecting the analysis that a business user intends to do to them. In order to guarantee the quality of data that an end user can see, it is important to test all the visualizations created in Qlik, by comparing the data stored in the Database/DWH with those displayed in this Reporting tool.

How to compare data between Qlik and DWH?

One of the Reporting tools that can be used as an integral part of a Business Intelligence system solution is Qlik Sense (Qlik). Nowadays, there is no out-of-the-box feature that allows, immediately, the comparison of data between Qlik reports and DWH. This type of analysis can be done, manually, as follows:

1) Extract data from Qlik’s report: Qlik provides a feature that allows the user to export data from its report/visualization. When choosing this feature, it is possible to create an Excel file with all the data that feeds this visualization.

2) Extract data from DWH: The extraction of data from the DWH system is done by reading queries created specifically to query the data stored in it. The idea is to create a query that feeds the visualization available at Qlik and export all the data returned as a result to a file where it can be checked (Excel).

3) Use Excel to compare the extracted data: The comparison of the extracted data mentioned in points 1) and 2) can be done using Excel. The data is copied and sorted into 2 separated worksheets: “DWH” and “Qlik”. The “DWH” sheet only has data that comes from the Data Warehouse, while the “Qlik” sheet only has data provided by a Qlik’s visualization/report.

These two spreadsheets can be read by other two (“Comparison” and “Comparison Details”). Each one of them applies a set of formulas to make a direct comparison of the values of each cells in the “DWH” spreadsheet with the respective cell of “Qlik” spreadsheet. The formulas used in the "Comparison" and "Comparison Details" worksheets are, respectively:

=Qlik!A2=DWH!A2 e =IF(Comparison!C2=TRUE;0;CONCATENATE(Qlik!C2;" - ";DWH!C2)). The following images show the use of Excel when comparing data between Qlik and DWH.

python

Figure 1 – Comparison between Qlik and DWH data – Spreadsheet DWH.

python

Figure 2 – Comparison between Qlik and DWH data – Spreadsheet Qlik.

python

Figure 3 – Comparison between Qlik and DWH data – Spreadsheet Comparison.

python

Figure 4 – Comparison between Qlik and DWH data – Spreadsheet Comparison Details.

This way, you can quickly analyse the data and see if there are any difference between the Qlik and DWH data and, if this happens, what the difference is and in which cases it occurs.

Which problems come from a manual analysis of data?

Comparing data manually between Qlik and DWH does not always result in a quick and efficient analysis. If the information extracted from a Qlik’s visualization comes up with a huge amount of data, analysing them may take some time. Applying a set of formulas, manually, to all lines and columns of an Excel file to compare data is not practical and requires some time from the person who is analysing them. Let us suppose that you want to analyse the invoiced sales for 2018, in EMEA countries. This analysis will come up with 50 lines, approximately. In this case, the analysis of data between Qlik and DWH is relatively fast.

However, if the purpose is to make an analysis of invoiced sales for 2018, in EMEA countries, by customer and product, the number of lines that we will get from Qlik and DWH will be much bigger than 50 and may be around thousands. In this case, analysing data between Qlik and DWH is much harder because it takes too much time, not only due to the repetitive task of applying, manually, all formulas to all lines of the Excel file, but also copying and sorting in the same way data that come from Qlik and DWH. Ensuring the same data sort of Qlik and DWH is also important when comparing them, because this will avoid differences caused by incorrect sort of data when they are copied to Excel spreadsheets. In short, the larger the amount of data, the longer the time needed to spend in its analysis.

How to automate data analysis between Qlik and DWH?

Comparison of data between Qlik and DWH can be done automatically using Python and Selenium. The idea is simple and consists of creating a script that receives, as input, the Qlik and DWH data, compare against each other and return an Excel file as output, similar to the one shown in figures 1-4. The created file can be used to analyse data when comparing data between Qlik and DWH. In this way, anyone is able to look at the file and understand how the data were compared to each other and which formulas were applied in its comparison.

As mentioned earlier, the idea is to pass some input data, compare them together, and return the result of their comparison. Usually, Qlik’S data is obtained through Excel files. DWH data can be obtained through a query made to the system without having to export them, previously, to an Excel file to be consulted. With Python, it is possible to define distinct modes about how Qlik and DWH data are passed as input.

Depending on how data is obtained, you can create a script that enables 3 types of input data and therefore 3 distinct modes of data comparison:

4) Excel vs Excel: two Excel files, one with data exported from Qlik and another one with data exported from DWH.
5) DWH vs Excel: an Excel file with data exported from Qlik, and a text file (.txt) with the query to read data stored in the DWH.
6) DWH vs Selenium: a text file (.txt) with the query to read data stored in the DWH and a link, passed in a configuration file, to the Qlik report/visualization.

Excel vs Excel:
If an Excel vs. Excel comparison is chosen, reading the data via Python requires creating a function that is able to read the Qlik and DWH data from the Excel file and store them in variables whose type is dataframe (qlik_df and dwh_df). These variables will then be used for writing the output file.

DWH vs Excel:
If the option is to choose a comparison of type DWH vs Excel, it means that a query in a text file is passed as input to get the DWH data. In Python, it is necessary to create a function that is, not only able to read the Qlik’s data in an Excel file, but also able to read the input query. This query must be passed as an argument to another function that can create a connection to the DWH system and execute it in order to extract the DWH data. As in the previous method, the data will be stored in dataframe variables (qlik_df and dwh_df) so they can be used later when writing the output file.

DWH vs Selenium:
For the case where the option is to make a comparison of type DWH vs Selenium, the solution, like in the previous mode, uses Python to establish a connection to the DWH, run the input query and extract the returned data as a result of the query’s execution. The difference is in how data is obtained from Qlik. An alternative is to use Selenium to access the Qlik visualization/report, apply some filters, if necessary, and export them to an Excel format without the need of doing it manually. Selenium is a framework, available for Python, that provides an API for writing system functional tests. We can use this framework to write code that simulates the behaviour of a business user when he wants to access or view a report in Qlik and export their data for analysis. The functions available in the Selenium API replicate the user’s behaviour, receiving as arguments, HTML or XPath code elements, and running over a webdriver. Once more, data can be stored in dataframe variables (qlik_df and dwh_df) so they can be used later to write the output file.

Data type conversion:

One of the problems that arises when extracting Qlik and DWH data is related with the data type of dimensions’ attributes. Sometimes, the same dimension’s attribute is created with a different data type in Qlik and DWH. Data from Qlik and DWH are sorted differently if, for the same attribute, the data type is different. In fact, their incorrect sorting leads to differences being detected when they do not exist. Take as an example the Customer dimension and the customer code as one of the attributes of this dimension. Let us suppose that there are 3 customers, whose codes are 2000, 3000 and 10000. If the client code, in Qlik, is represented as a string, sorting these 3 customers, in ascending order, generates the following sequence: 10000, 2000, and 3000. If the same attribute is created in DWH as an integer, sorting these 3 customers, in ascending order, generates the following sequence: 2000, 3000 and 10000. A workaround for dealing with differences in data types between Qlik and DWH is to convert all attributes of dimensions to string and all metrics as float when reading data in Python. With this, we can guarantee that Qlik and DWH data is always sorted in the same way.

How to distinguish dimensions from metrics?

A simple way to distinguish dimensions from metrics is to indicate the number of metrics involved in the analysis, in a configuration file, and analyse the data by creating reports/queries that return tables with dimensions in the left side and metrics in the right side. In Python, when reading Qlik and DWH data is possible to know the total number of table’s columns, as well as the total number of metrics’ columns (mentioned in the configuration file). The number of dimensions’ columns (M) is obtained by subtracting the total number of columns by the number of metrics (N). With this, we can do 2 different iterations. The first M columns are dimensions, so its data are converted to strings, while the last N columns are metrics, and, so, its data are converted to floats.

Rounding metrics:

It is very common, when comparing data between Qlik and DWH, to get differences in metrics values. These differences often arise due to the precision of metrics, leading to values that are not equal in the last decimal places of Qlik and DWH metrics. Not being a significant difference, it makes sense to round the metrics values by 2 or 3 decimal places when comparing the data. In Python, we only have to call the function responsible for rounding values in 2 or 3 decimal places in all metrics’ columns. The workaround mentioned, previously, to distinguish dimensions from metrics can also be used to identify the metrics’ columns and iterate over them to round its values.

It is important to note that, regardless of how data is compared (Excel vs Excel; DWH vs Excel or DWH vs Selenium), the data type conversion, as well as the rounding values of metrics is done, in Python, after reading data from Qlik and DWH and before storing them in qlik_df e dwh_df dataframes variables. For the rounding metrics, these 2 variables will also have a few new columns with the data rounded to 2 or 3 decimal places.

Writing the output file - Excel for comparing data:

Before writing the Excel file with the comparison between Qlik and DWH data, it is important to check the number of rows and columns of Qlik and DWH data stored in the qlik_df and dwh_df variables. If the number of lines and/or columns is different in both, it will be returned a log file, as output, with that information, instead of an Excel file. This file is only created if the number of lines and columns is the same in both dataframes.

A possible solution for writing the Excel file created as output can be done by creating a function that takes as arguments the variables qlik_df and dwh_df, as well as the path where the file that will be returned as a result should be created. In short, this function receives, as argument, data from Qlik and DWH and writes this data to an Excel file, as well as the formulas used to its comparison, returning TRUE or FALSE if data are equal or not, respectively. The comparison’s detail is also returned (Qlik’s value versus DWH’s value). The following are the steps of the function, as well as the order of its execution.

1) Sort data: Qlik and DWH data are sorted in the same way, following an ascending or descending order.

2) Write data from DWH to the excel sheet "DWH": Data stored in dwh_df variable are copied to the DWH spreadsheet of the comparison Excel file.

3) Write data from Qlik to the excel sheet "Qlik": Data stored in qlik_df variable are copied to the Qlik spreadsheet of the comparison Excel file.

4) Write Generic Comparison Formulas: A dataframe (comparison_df) is created with a set of comparison formulas to be applied to each Excel cells fulfilled with the Qlik and DWH values from the qlik_df and dwh_df variables, respectively. The formula to be applied can be defined in a configuration file such as ‘=Qlik!??=DWH!??’ and dynamically constructed according to the number of rows and columns from the variables qlik_df and dwh_df, where the symbol ‘??’ is replaced by the cell value of the Excel file where they will be applied (for example, =Qlik!A2=DWH!A2). Figure 5 shows an example of the created dataframe comparison_df if the following variables have 5 rows and 4 columns: qlik_df e dwh_df.

python

Figure 5 – Set of formulas generated and stored in variable comparison_df for 5 rows and 4 columns.

Finally, the data stored in comparison_df variable are copied to the Comparison spreadsheet of the comparison excel file.

5) Write Detailed Comparison Formulas: A dataframe (comparison_details_df) is created with a set of comparison formulas to be applied to each excel cells fulfilled with the Qlik and DWH values from the qlik_df and dwh_df variables, respectively. The formula to be applied can be defined in a configuration file such as ‘=IF(Comparison!??=TRUE,0,CONCATENATE(Qlik!??,\" - \",DWH!??))’ and dynamically constructed according to the number of rows and columns from the variables qlik_df and dwh_df, where the symbol ‘??’ is replaced by the cell value of the Excel file where they will be applied (for example, =IF(Comparison!A2=TRUE,0,CONCATENATE(Qlik!A2,\" - \",DWH!A2))). Figure 6 shows an example of the created dataframe comparison_details_df if the following variables have 5 rows and 4 columns: qlik_df e dwh_df.

python

Figure 6 – Set of formulas generated and stored in variable comparison_details_df for 5 rows and 4 columns.

Finally, the data stored in comparison_detail_df variable are copied to the Comparison Details spreadsheet of the comparison excel file.

6) Apply conditional formatting to the Comparison and Comparison Details worksheets: For these spreadsheets, the cells are filled in green or pink depending on the result of the formulas mentioned in points 4 and 5. If the values displayed in the cells of the "DWH" and "Qlik" worksheets are the same, the respective cells in the "Comparison" and "Comparison Details" spreadsheets are filled in green. Otherwise, they are filled in pink.

python

Figure 7 – Conditional formatting applied to the Comparison and Comparison Details worksheets.

7) Apply filters and freeze the header in all spreadsheets of the output file: In all spreadsheets (DWH, Qlik, Comparison, and Comparison_Details) there is a header with the name of dimensions and metrics extracted for analysis. Headers are frozen and filters are applied to them, automatically.

python

Figure 8 – filters and freeze header applied to all spreadsheets.

After performing these steps, the comparison file is created, similar to the one shown in Figures 1-4, and is available to aid anyone analysing data.

Configuration file:

In the previous paragraphs, it was mentioned the configuration file. In this file, constants can be defined and its values can be configurable, making it easier to change. Here are some examples of configurable parameters:

• Comparison formulas to be applied in the Excel file;
• Path of input and output files;
• Connection data to DWH (Hostname; Port; User and Password);
• Link to access the Qlik report/visualization;
• Filters to be applied in a Qlik visualization;
• Number of metrics to analyze;
• Comparison mode to be used:
1 – Excel vs Excel
2 – DWH vs Excel
3 – DWH vs Selenium

Persist the result returned from comparing Qlik and DWH:

With this solution, the result obtained with the comparison between Qlik and DWH data is only available in the Excel file. If we want to save the result in a database, like Redshift, quickly and without having performance issues due to the amount of generated data, a simple workaround would be:

• Convert the excel file data to .CSV format;
• Add this new file to S3;
• Establish a connection to the Redshift and execute the COPY..FROM command to create a new table with the excel file data.

Since the data is available in Redshift, it would be possible to query them directly or even create a report in Qlik to analyse them.

Blog