30 March 2016

Approach for a simplified custom Data Quality Monitor and Notifications Solution

If you are a Business Intelligence professional, it isn't uncommon for most of you to spend many hours looking at data, trying to identify the reason why it's not showing the expected values or it's missing. When we finally find the root cause, once again, it isn't uncommon to be due to one of the following reasons:

> Issues on Source Data
> Missing configurations
> Jobs that failed in the overall process

Once all this intensive investigation is completed, we then may have to start new actions like adding new configurations, contact business users, contact other IT teams… in order to correct this situation. Then some days, weeks or months later, we may have again a similar issue. When this happen, we may or may not remember what we did before and we may need to repeat the same investigation activities we have done before. In many of these situations, we may not notice the error right away and we will probably be informed by a business user that something strange is happening, with a negative impact to them.

In order to mitigate this kind of situation, we already have on the market Data Quality systems (paid or free) that can help us. From experience, installing new software on the clients machines is not so linear and may requires some approvals that sometimes are not easy to obtain. For this kind of situations, where I need something that can be easily implemented and that can reduce the effort to monitor data and configurations… having a custom solution like this, has proven to be quite handy in my daily work. It may not be a state-of-the-art solution, but it can help us and we can always evolve to a more "Official" tool.

Given this premise, I've worked on an algorithm that can be quickly applied to a project. As it can be fully developed on SQL, it can also work with multiple database technologies. The purpose of this article is to present the high-level algorithm that can be implemented once and then re-used on different projects. This solution consists of two parts:

> Quality engine, where we store the tests logic and the test results
> Notification engine, where we can inform users of an issue. This notification is done by email. We can setup multiple notification rules and also recipients

Simplified Algorithm

The simplified algorithm is show below:

As a summary, we have a manually or schedule execution of the test system on the top left of the image. These results are stored in a central test results table (one per day/time and test). This is then picked by the notification system that may send this information by email with the failing data. This notification, besides any relevant data, may also include a description of the needed steps in order to solve an issue. This is especially useful in the case we send the notifications directly to the business users. Besides the information in the quality system, as these two systems can work independently, we are also able to include additional information coming from external error tables that may be useful for the user.

The notification by email has a limitation on the size of the attachments that change from company to company. Nevertheless, this email solution has also some advantages. One of them is to be able to send notifications directly to business users and tell them which actions they need to take in order to correct a situation. The second one is our ability to receive the results directly in the mailbox, without the need to open additional tools or reports.

Detailed Algorithm

For a more detailed explanation of the idea, we can look at the below diagram.

Quality System

> Tables

> Tests Table - contains the tests to be executed by the quality system. Per each test, there is a column with the SQL that will be executed and will test a certain condition

> Test Results Table - contains the results of the tests executions, one test result per date. Stores the output of the test results. No detailed data is stored here regarding the actual failing data

> Rejected Records Table - May contain actual sets of inconsistent data. When compared with the Test Results Table, this one may contain the actual failing records

Stored Procedures
> Execute Data Quality Tests - This is the main core of the Quality System. This is used to call each test in the Tests Table, run the tests SQL code and store the results in two tables, one for the test result and the other one for the rejected records.

> Quality Tests Manage Error Tables - Having a rejected records table, unless the size of the table is controlled, it can be a problem as the tests are normally executed once a day. In order to avoid the uncontrolled increase of the table, we need to have a Stored Procedure that can delete old records.

Logic
> A test is selected from the Tests Table, manually or by schedule. The tests logic should always be if returns records, it's a failure, if not it's a success

> The test SQL is executed and is tested if the number of returned rows is zero or different from zero

> If it's zero, it means the test was successful. This is stored in the Test_Results table as a success and no record is put into the Rejected Records Table

> If it's not zero, it means the test was unsuccessful. This is stored in the Test_Results table as a failure and we can store the error records in the Rejected Records Table

> After all tests are executed, the Stored Procedure to clean Error Tables is executed in order to delete old records and avoid an excessive size of the error tables

Notifications System

Tables
> Emails Configurations Table - contains the emails addresses to be used for the notifications and also the rules for the notification. The rules can be the email subject, body, attachment rules, recipients...

Stored Procedures
> Quality Tests Notify - This is the main core of the Notifications System. This is used to call each notification in the Notifications Tables. It gets the detail of the notification, collects all the tests results, and send the notification

Logic
A notification is selected from the Notifications Table, manually or by schedule.
According with the Notification parameters, the notification will be sent
For the ones that are sent by email, the user will receive an email with the attachments of the inconsistencies

Conclusion

As stated, the idea of this presentation is to give a High-level solution of a custom Quality and Notifications system, which can be useful for small projects or for well-defined situations. This can be an entry solution to test and validate this kind of tools, and we can always move forward and apply one of the existing market solutions. This can also be useful in project mode, in order to control potential issues with the data and in a pro-active way to detect issues and correct them.

       
.
.
.
.
.
.
       Pedro Nunes
          Manager
Solutions Development
Blog