21 January 2015

Implementing Analysis Services “Process Add” Functionality – A Simple Approach

I was given the task of improving a 60GB SSAS (SQL Server 2008 R2) database process time. This database presented the following main properties:

  • 30 dimensions;

  • 5 cubes (3 measure groups each) partitioned by month;

  • Approximately 150.000.000 (150 M) rows per measure group;

  • Total size: 60GB;

  • Full Process every 3 days (on average).

Applying a Full Process to this database takes about 1h and the end user needs a data refreshment every 3 days (on average). But is "Full Process" really necessary? In this case, no. In each batch we only have changes for one particular month of data, which means we only need to process one partition (one month) per measure group.
So, for this first essential improvement, I have decided to develop an SSIS package to manage and process the partitions. In short, this is how it works in the end (I'll save the details for another post):

What about dimensions?

One should not forget to focus on dimension processing (it's the purpose of this postJ). It's a critical part of this problem since we have 4 considerably big dimensions (with a few million records each). Clearly, applying a Full Process to every dimension will take some time… The good news is that these are very high cardinality dimensions (almost degenerate) and there are no UPDATE or DELETE operations involved, hence the possibility of applying "Process Add" instead of "Full Process" or "Process Update". Also, if there are no new records to consider, we don't even need to process the dimension at all! The other 26 dimensions are much smaller and don't justify any kind of optimization at this point.
I will now describe how to setup "Process Add" for a dimension process task. As stated by Microsoft, there is no UI to this purpose in this version of SQL Server ("Process Add is not available for dimension processing in Management Studio, but you can write XMLA script performs this action".). OK, so the XMLA script should look like this:
                                                    DB_A                     DIM_A                                ProcessAdd                UseExisting                
We just need to force the task to consider only the new records that need to be processed. The easiest way seems to be making the underlying Data Mart view return only the new rows. We can easily alter the view simply by adding something like "… WHERE ID > N" where N represents the last ID that was processed to the AS dimension.
How to find N? I considered 2 possible solutions:
Solution A: Run an MDX query that returns the last member to be processed. Something like this:
WITH MEMBER [Measures].[Max key] AS Max({[DIM_A].[DIM_A].MEMBERS}, STRTOVAL([DIM_A].[DIM_A].CURRENTMEMBER.PROPERTIES("KEY"))) SELECT [Measures].[Max key] on 0 FROM [CUBE_A]
Solution B: Query the underlying Data Mart view for the MAX (ID) and store the result in some configuration table to use on the next process task.
Solution A seems good and very direct. Also, it keeps working if someone decides to process the dimension in Full mode (without somehow storing the maximum ID). The problem is the time it takes to execute the MDX query. The MAX () function can get pretty slow when dealing with a very large set! So… solution B is the one to go for!
The diagram for the entire process:

Here's the final tests result:

Dimension Name Total number of records Number of records do add (aprox.) Process Full duration (minutes) Process Add duration (minutes)
Dimension A 14.8 M 5% 15.3 5.4 (-65%)
Dimension B 9.0 M 5% 8.3 3.4 (-59%)
Dimension C 0.2 M 5% 0.3 0.2
Dimension D 0.1 M 5% 0.2 0.1

 

You can notice the duration improvements in dimensions A and B. However, I expected a better result since there were only +5% rows to process. When looking at the XMLA process outputs I noticed that most of the "ProcessAdd" time was spent processing hierarchies and indexes for the existing dimension attributes.
As expected, the "reading data" and "writing data" tasks were almost immediate. The dimensions C and D didn't show any improvements, probably due to the relatively small sizes. I decided to keep the "Process Update" tasks for these two (to maintain process simplicity J). That's it for now, I hope it helps! I'll try to post some new results/conclusions as the model grows in size.

.

.

.

.

      Pedro Teixeira
  Associate Manager
Solutions Development