11 January 2019

Logical Data Warehousing Validations

Introduction

This article demonstrates three logical data warehousing validations that should always be conducted so that the quality of data can be assured in terms of not having duplicates in the tables, not having nulls or/and empty strings and not violating the referential integrity in the data warehouse.

The validation to check for duplicates consists of verifying if there’s more than one record for the same key in a table.

The purpose of the nulls and empty strings validation consists of making sure that no column has nulls or empty strings in the data warehouse because the tables cannot be presented with either.

As for the validation regarding referential integrity, the purpose is to ensure that all the values of the surrogate keys (SK) are correctly referenced between the dimensions and the fact tables in the data warehouse.

Below is the example model that will be used for this article to demonstrate the data issues that can be identified with the three logical tests:

logical data warehousing validations

 

Check Duplicates

The validation for checking duplicates consists of detecting if there is more than one record for the same key in a table. This validation is very important in the context of a data warehouse. Let’s take for instance the example model of this article. In case there’s a duplicated record for a sale, the total value of the sales for a specific product will be incorrectly inflated.

Dimension

To make these validations it is possible to use a set of queries that validate all the dimensions that exist in the data warehouse. These validations are made according to the respective key of the table.

     1. Key with only one column

Below there’s an example of a dimension with a key made of only one column. In this case the column that defines the dimension key is the Product_Code because there can only be one product per record in the dimension D_Product. As can be verified, in the first two records we can find a duplicate record.

To verify the duplicate above, it is possible to do so by using the following query:

The query should return 0 results. If it returns any record, it represents a duplicate that can consist of one or more lines, depending on the quantity of records that it returns. This quantity of records can be observed in the column number_of_duplicates.

In this case, the query will return one record, which represents a duplicate for the product_code PRODUCT1.

     2. Key with multiple columns

To verify if a table has duplicates, where the key is made of multiple columns, let’s take as example the Snowflake table sales_category of our model. The SQL query must consider the set of columns that make up the highest granularity of the table. In this case it is the combination of product and client, since there can only exist a category for each combination of both.

As such, the query for the sales_category is created in function of those two columns. So, to verify the duplicate above, we use the following query:

The query should return 0 results. Same as before, if it returns any record, it represents a duplicate that can consist of one or more lines, depending on the quantity of records that it returns. This quantity of records can be observed in the column number_of_duplicates.

By executing the query above, we will have one record returned, which is the duplicate.

Fact Table

   1.Key by Degenerate Dimension

Using our example model, let’s suppose that the fact table F_Sales has a column named Transaction_ID that is defined by a unique reference for each sale in the fact table. The level of granularity is, as such, defined at that column, which in this case the column represents a degenerate dimension.

The query used to find duplicate records for the example above is the following:

The query should return 0 results. Same as before, if it returns any record, it represents a duplicate. The quantity of records can be observed in the column number_of_duplicates.

     2. Key with multiple columns

The fact table level of granularity is normally defined at the combination of the unique dimension identifiers, which are foreign keys or surrogate keys, attributes and non-additive measures such as unit price, for example.

Example of records in the table F_Sales:

Using once again our multidimensional example model, the query to search for duplicates in the fact table is the following:

The query should return 0 results. Same as before, if it returns any record, it represents a duplicate. The quantity of records can be observed in the column number_of_duplicates.

By executing the query above, we will have one result, which represents a duplicate. Looking closer at this record, it can be a duplicate or a missing aggregation, which translates into 2 records instead of 1. In either case, there is an issue and it has been detected through this validation.

Validate the Referential Integrity

The tables that make up a data warehouse in a multidimensional model must obey the referential integrity rule, meaning that all foreign keys must necessarily have the correspondent primary key in the source table. The same applies to tables that are using artificial keys such as Surrogate Keys.

It is important to make this validation because, for instance, if a fact table has product keys that don’t exist in the dimension, this means that there are products that are being accounted for in an incorrect way.

Looking at our multidimensional model example of this article, a few analysis cases follow:

     1. From Fact Table to Dimension

The objective is to ensure that all the surrogate keys present in the fact table F_Sales exist in the correspondent dimensions. In other words, all the values in the column sk_date present in F_Sales must exist in the column sk_date in the dimension D_Date, and the same logic applies to the other SKs.

The query to validate if there are sk_customer in F_Sales that don’t exist in D_Customer is the following:

The query should return 0 results. If it returns any record, it means that there are sk_customer in F_Sales that don’t have a reference in the dimension D_Customer.

     2. From Snowflake to Dimension

In the same way, we must ensure that a Snowflake table has all the correspondent surrogate keys present in the dimension table.

Going back to our model, the following query validates if all the sk_customer that exist in the sales_category table also exist in its source table (D_Customer):

Check for Nulls and Empty Strings

The purpose of this test is to detect the existence of Nulls and Empty Strings, since neither should exist in any data warehouse table. The validation on Nulls and Empty strings is checked for data types VARCHAR and NVARCHAR, for the remaining data types (e.g. integer, double, …) the validation is done only for nulls.

Analyzing the example below, for the second sale we can verify that a value is missing for the invoiced_sales (null). These cases cannot happen in a data warehouse as we can’t present a record of a sale without knowing its value. Another example is an empty string in an attribute, as can be verified in the third sale of the same table. In this case, without this value, the source of the sale is unknown, which can lead to conflicts in the presentation layer if the source of the sales is being taken into consideration in the calculations presented.

It is possible to make these validations by using a set of queries that validate each column in the existing table. To analyze the cases above the following queries can be used:

  • To validate the column Source_System (data type varchar)

  • To validate the column Invoiced_Sales (data type decimal)

  Ana Russo            
 Consultant     

 

 
Tiago Marques       
   Consultant     

 



					
Blog