Long Running “Read Data Source” – Dimensions
Some time ago, we spotted a cube that took around 3 more hours building in Quality than in Production. This was the log file generated:
Notice that the first row comes in 5 minutes after the request, but then Transformer takes 3 hours and 37 minutes to load 85k rows. This issue would show any time of day, and could be replicated by using the “Generate Categories” on that datasource. Why would this happen?
Isolating the Issue
The first thing you should notice, is how the steps 0.3 , 1.1 , and 1.2 are timed together in the logs. This means that when that one step takes 3 hours, there is no way you can know immediately whether the issue is: the database, the data link (between db and Cognos), the writing of categories on disk, or even the disk itself. Not sure what your first guess would be, but our first logic thought was looking into the database. This was re-enforced by the test below, on which we pointed the Quality Server to the Production Database, and all went smoothly:
Something should be happening on the database, and the rows were being returned at a very slow rate.
Hypothesis 1: Database
In order to test the database as being the source of the issue, you need to isolate its performance. We opted for extracting the query into a CSV file using a Query Tool. Depending on what you find here, if the extract does take the 3h, you may then need to run the extract on the server itself to rule out the Data Link as being the bottleneck. However, in this case that was not necessary, as the extract of the whole query took around 10 minutes using my laptop. This proves that the database is not the issue, but then, what could it be?
Hypothesis 2: Category Creation
All debug till now clearly isolated the issue inside the Category Creation step. The data volume were not the issue here for sure, for these servers 85k rows should be like a walk in the park… furthermore, and having similar cardinalities, why would it behave so differently in Production and Quality? (the infrastructures were extremely similar) At this point I started thinking it had to be from the Data Pattern, and people would ask me what I meant with it. I just knew it was not because of the amount of data, not from the servers, nor from the table layout, so it had to be something with the data itself. In order to test this theory, I needed to eliminate the “data read” performance from the picture. In order to do so, we extracted the data from the Data Source Query into a CSV file from both PRD and QA, and changed the cube model to generate categories from a CSV source present in the local disk. This allowed a test where everything was the same (model, server, “data feed” performance) except for the data itself. Results were as follow.
With this information available and with the help of IBM support, we started suspecting the “Move” option could be playing a part in all this issue. Removing the Move option from this dimension, the following errors were observed:
PRD: TR2318 Transformer has detected 31313 attempts to create a category in more than one path.
QA: TR2318 Transformer has detected 101372 attempts to create a category in more than one path. and, on both tests, the read data took less than 2 minutes.
Worth noticing also, is the fact that more moves were happening in QA than in PRD, and this was coherent with the results we had previously (worst performance in QA). This clearly isolated the issue as being related with the Move option, and the fact that Transformer was being forced into an abnormal amount of moves. (I recommend a quick read through the online documentation on the Move option here).
The “Move” Effect
To explain this behaviour, we will need to focus on both the functionality and on the data patterns we had on the data warehouse. In this client (as in many others), we use denormalized dimensions. In order to represent hierarchical levels that are not granular, a -1 / Not Valid was used for non-assigned levels:
United Kingdom > -1/Not Valid > -1/Not Valid
United Kingdom > England > -1/Not Valid
United Kingdom > England > London
United Kingdom > Scotland > -1/Not Valid
On a first glance, it does not seem to trigger any warning flags. The not valid levels of the hierarchy are identified with a default value, as in so many other hundreds of data warehouses. However, the problem is how this interacts with the Move capability, and how these 4 rows have just created 2 moves:
Moving categories takes time, and even more time when 101.372 moves are needed. We have also identified that this issue goes worse when we increment the number of parallel hierarchies on the offending dimension. The solution implemented was pretty simple: exclude the rows that have city not classified (except for the actual “Not Classified on all levels” row).