28 June 2016

Building Tableau Datasources

Building Tableau datasources isn’t a trivial task due to the way the software is built to work.

Tableau performance will be directly impacted by the computer’s hardware capabilities of CPU, RAM and disk specifications.

For this article, it will only be taken into account a connection to a database.

Types of datasources

Tableau allows the creation of two types of datasources:

  • Live – connects directly to the source data. If a database is used, it queries the database on all changes done on the dashboard. This is ideal if the amount of data is small or if the query is performant enough;
  • Extract – intended to be used for large volumes of data. It generates a file that can be stored in Tableau server or directly in the user’s computer. With this option, there’s also an optimization done by Tableau software (unknown logic) that increases performance when compared to a Live datasource.

Performance considerations

The first thing that developer should be aware is that Tableau represents data by adding all the columns as if it was a single table. This means that performance issues are more likely to occur when more columns are added to the datasource.

By default, if tables are added without specifying columns, all columns from that table are added.

In order to avoid impact from unnecessary columns there are some best practices that make sense when building datasources:

  • Instead of dragging the tables to the datasource builder, use specific select statements that include only the columns that are required in the datasource using custom SQL. When using this method, make sure that a minimum of custom SQL and logic in the query is created, without unnecessary joins to other tables (ideally creating views in DB can be an interesting approach);
  • When complex custom SQL is required, it is preferable to create views directly in the database as there’s an increase in performance when compared to the SQL execution triggered by Tableau;
  • Hide unused columns – this can be done directly in the datasource configuration or in the column list. Only relevant for extracts as result file is smaller because it doesn’t contain unnecessary information.

Here is the example of the size of the extract of the same DS, with and without hiding unnecessary fields:

In this case around 100 fields where hidden, keeping only a few for the analysis, only for demonstration purposes.

  • Use the filters in Tableau in order to retrieve only the data that is relevant.

  • When creating an extract, it is possible to optimize it. Tableau provides an option that when selected will calculate the value of the calculated fields that would otherwise be calculated at runtime. If data changes frequently, then this option needs to be used with caution as data for calculated fields might not be up-to-date:

Referential Integrity

Whenever it is possible, it is always an advantage to use Referential Integrity. This ensures that all records present in the tables have a corresponding record in the tables where it relates to. This option is more likely to be used when data has that referential integrity, but isn’t specifically configured in the database tables.

As described in the following article Assuming Referential Integrity, having two tables that relate through a key, the natural operation is to do an INNER JOIN between both tables.

SELECT SUM([Sales Amount]) FROM [Sales] S INNER JOIN [Product Catalog] P ON S.ProductID = P.ProductID

However, when Referential Integrity is set, Tableau only queries the table where fields are being pulled from, avoiding the INNER JOIN operation, therefore decreasing the access time to data:

SELECT SUM([Sales Amount]) FROM [Sales]

This is only possible, because Tableau assumes that data from the left part of the join exists and has one and only entry on the table from the right part of the join.

 

Publishing Datasources to Tableau Server

Publishing a Live datasource is quite straightforward when comparing to publishing an extract.

In order to publish an extract to the server, it is necessary to retrieve all the data to the local computer. This will generate a file that can after be published to the server.

It isn’t possible to have a Live Datasource published into the server and then transform it into an extract directly in the server.

Tricks on publishing large extracts

Recently, it was necessary to work on a large extract 2GB+ that took several hours to generate in the computer where it was being developed. As a developer, it isn’t feasible to wait that amount of time in order to publish it to the server.

Using this principle, a simple idea was used in order to decrease the waiting time for this process.

Created a simple parameter table that will include a value which represents the maximum of rows extracted in the main query:

This parameter used in combination with main data query:

This parameter used in combination with main data query:

When it is necessary to make an extract in order to be published into the server, this parameter is changed to a low number (let’s say 100), and now it is possible to make an extract in a few minutes. Second step is to publish it to the server, update the parameter back to a large number and trigger the refresh process directly in the server.

This operation will refresh the extract already published to the server with all the records that should be in there.

.

.

.

.

      Sérgio Lopes
        Consultant
Blog