17 November 2017

SQL Server R Services

1. R, BEYOND A SIMPLE LANGUAGE

R, in addition to a programming language, is also an environment where it is possible to explore, manipulate and represent data with an essentially statistical approach. As such, this tool is dedicated to the field of Data Mining, since it allows the user to have a vision of how the data relate, behave and tend to behave in the future.

1.1 HOW IT WORKS

The R tool works based on mathematical functions and functions for predictive purposes. These functions are usually stored in packages, which need to be imported beforehand, to be able to enjoy their functions.

R also has an official repository for online packages, known as CRAN (The Comprehensive R Archive Network), where the entire community can share their work and provide new functionalities implemented, assuming that everything complies with the rules of the repository.

2. R SERVICES

SQL Server has a Stored Procedure called by sp_execute_external_script, which allows you to interpret other languages, as is the case with R. With the new update, SQL Server 2017 will also have the ability to interpret the Python language.

In order to use this Stored Procedure, it is necessary to pay attention to the syntax used. The following image shows the structure that the code should take.

 

2.1. CODE STRUCTURE

2.1.1. LANGUAGE

In this first section, the language to be processed in script execution is passed to SQL Server.

2.1.2. SCRIPT

The @Script parameter is where the code is invoked, and users can take advantage of all the functions that R has to enhance their work.

2.1.3 INPUT DATA

This is where we define the data to work. The query defined in this parameter will dictate the DataSet that will be subject to transformations under the R functions executed in @Script.

2.1.4. OUTPUT DATA

This parameter defines which variable is given as the result of @Script, if you want the same result to be Output.

2.1.5. PARAMS

If you want to pass to the @Script variables defined outside the Stored Procedure sp_execute_external_script, you must define the metadata of the same, and transmit to the SP the name that will represent them when the R code is executed.

In this case the external variables are @alpha, @betas and @gamas, and @a, @b and @s are the variables that will represent them in the @script parameter.

3. APPLICATION

For an example of application of this tool, this article will describe a case study that aims to determine the number of units shipped, each product, of a warehouse, in several future time periods.

3.1. ALGORITHM TO BE USED

In the first instance, it is necessary to analyse the dataset in which to work, to know the data and its behaviour. In this case, we are dealing with a seasonal time series, and as such, the method chosen to address the situation was the Holt-Winters algorithm, also known as Triple Exponential Smoothing.

3.2. HOLT-WINTERS

This algorithm applies only to models of Seasonal Time Series, that is, time series that repeat their behaviour between constant periods.

The algorithm receives three parameters, these parameters that define the weight of the value of the components of the current period to the value of the components of the previous period of the series.

The three components of the algorithm are: Level, Trend and Seasonal Component.

And are defined as follows respectively, the latter value being the result of the forecast.

 

3.3. MODEL

Having defined which method to use, it is now necessary to build the model that will generate the intended forecasts. The data will be divided into 2 groups, training group, which will teach the model to behave in the future, and test group, which will allow comparing the predicted data with reality.

It will also be necessary to find out which parameters generate the best possible model for the training data.

This can be a time-consuming process, and as such, an automatic calculation method was created in order to generate models for the 10 products with the most stored units. The method consists of iterating for each product, models that have alpha (level component) and beta (trend component) variables, and gamma (seasonal component) constant. For each iteration, the MAE (Average Absolute Error) of the model is calculated as well as the number of values that are in a 10% error window for the values of the test group.

All of these values, i.e. product number, date, model parameters, MAE and Quantity of acceptable values, are stored in a table that receives these values for each iteration of the process.

The code that follows represents all the reasoning described above.

 

 

 

 

 

 

Once this process is executed, we have a table that contains the best model associated with each of the products.

Therefore, in order to achieve the desired goal, it remains to apply the forecast to the models obtained, in this case a 12-month forecast will be made.

 

 

 

At the end of the process, a final table is obtained that allows the user to cross the expected data with the actual data, present in the test model, and thereby prove the quality of the implemented model.

4. CONCLUSION

As you can see throughout this article, SQL Server R Services is a very useful and powerful tool that allows you to go beyond a simple search in the data. It has a strong interaction with the conventional Transact-SQL and allows manipulating the data contained in the Databases in a very interesting way. It may be that the disadvantage of a first contact is not easy and intuitive, but once the initial difficulties are over, it is a great value to bear in mind.

.
.
.
.
.
.
   Diogo Candeias
 Junior Consultant