4 February 2015

RapidMiner – A First Approach

In this post I'll start to do a brief introduction to RapidMiner. After that I'll talk about some important points to me, like the analytical engines and the extensions in RapidMiner. In the end it's showed the environment and presented a process example.

Introduction

RapidMiner is a project started in 2001 at the Artificial Intelligence Unit of the Dortmund University of Technology with the name YALE (Yet Another Learning Environment). This project, nowadays, became one of the best and most important analytical tools for business and science.

But what exactly is RapidMiner?

RapidMiner "is an environment for machine learning, data mining, text mining, predictive analytics, and business analytics"[1]. It allows the user to use a wide variety of techniques, since ETL techniques, the application of a huge variety of data mining algorithms, data preprocessing and visualization, evaluation, creation of web-based reporting and dashboards …
The great advantage, comparatively to other powerful tools, it is the fact of being a very intuitive one, allowing their users to build all the processes needed through an intuitive graphical user interface.
In the last years with his growth and improvements RapidMiner is already one of the most used software. KDnuggets, a data-mining newspaper, recently ranked it first in data mining/analytic tools used for real projects in 2013.

Processing Different Volume Data Sets

With the volume, variety and velocity of data coming into your organization continuing to reach unprecedented levels, the data set size can be a problem when you decided to do something with it. To avoid this kind of problems, RapidMiner provides different analytical engines that can be used according to data set sizes:

In-Memory:

  • Default engine;

  • The fastest way to build analytical models, when the characteristics of hardware allow it.

In-DataBase:

  • Analysis is performed in database where the data stays;

  • Unlimited data set size, this is, the limit is the external storage capacity;

  • Slower that In-Memory.

In-Haddop:

  • Has the advantage to allow a distributed storage engine;

  • If you have a powerful Hadoop cluster you can have interesting runtimes;

  • Like the In-DataBase the limit is the external storage capacity.

It's always recommended that if possible use In-Memory mode, generally the fastest way.
When the data size increase a lot, and the behavior mode doesn't fit, then you can choose between the others. Because the higher setup and infrastructure costs, RapidMiner advises to only use Haddop when you have large data sets and runtime is an important issue at the same time. For the rest, and if runtime is not a problem, In-DataBase looks to best option.

Extensions

One interest point in using RapidMiner is the extensions provided by the company as well third-parts providers and the community.
An extension is an add-in that you install in the tool and allow you to enjoy extra tools that you can need and don't exist originally in the application. These are offered via a kind of app store for analytical solutions with name Rapid-I Marketplace (http://marketplace.rapid-i.com/UpdateServer/faces/index.xhtml).
Popular extensions that are currently downloaded are for example:

  • R connector: allow the use of the models existing in the popular Data Mining tool;

  • Weka: adds more than 100 additional operators from the machine learning library Weka;

  • Text: offers statistical text analysis as well as the possibility of doing transformations, by different filtering techniques, in texts came from different data sources like plain texts, HTML, PDF and PDF.

  • Web: access to internet sources like web pages, RSS feeds and web services

A First Look

The image below shows how the RapidMiner looks like. Marked with the:

  • Number 1 –the operators that can be used to design a process.

  • Number 2 – the parameters that you can configure for each operator

  • Number 3 – the local where you design the process by drag-and-drop the operators

  • Number 4 – the perspective view of the processes

  • Number 5 – the buttons to run, stop or pause the processes

Accessing Data

RapidMiner allows us to connect to a huge different of data sources, for example:

  • Excel;

  • Access;

  • Csv Files;

  • All databases via JDBC or ODBC including Oracle, IBM DB2, Microsoft SQL Server, MySQL, Postgres, Teradata, Ingres, and many more;

  • Connector to SAP;

  • Text documents and web pages, PDF, HTML and XML;

To connect to a source, for example a csv file, you have to follow the next steps:

  • Define what's the source type that you want to import and then select the respective operator (sector 1 of Figure 2);

  • Drag and drop the operator to Main Process (sector 3 of Figure 2);

  • Define the necessary parameters (sector 2 of Figure 2 and Figure 3);

At last you just have to run the process by clicking in button in the sector 5 of Figure 2 and you will able to see the respective results inside the selected source, in this case the csv file content (Figure 4).

Defining Processes

After this brief introduction to RapidMiner, I'm going to describe and present a small example process in this environment.
This example extracts data from an Excel source and then validate the confidence of Naïve Bayes Algorithm. For this are used the operators:

  • Read Excel: To read the data;

  • Select Attributes: To select just the necessary attributes to the next steps;

  • X- Validation: To train and apply a learner on Data and validate his performance;

  • Naïve Bayes: Learner Algorithm;

  • Apply Model: Apply the learner algorithm, that in this case is the Naïve Bayes;

  • Performance: Measure the viability of the model;

Step 1 – Extract and define attributes

Where you select the data source, define the attributes type and filter just the attributes needed.

Where you select the data source, define the attributes type and filter just the attributes needed.

Before decide what learning operator to use, normally you test and measure different learners. To do this you need an operator that can be evaluated, in this case I use one of the most commons, Cross-Validation.

Step 3 – Choose a learner and evaluate his performance

In this step you have to decide now the classifier, and put it in the training area. After this you must select an operator to apply the classifier that you choose and an operator to measure his performance.

Resuming the figure above, you'll have a training area, where you'll train the classifier and then a testing area where you apply the respective classifier trained before and measure his performance.

Step 4 – Final Result

After run the described process, you'll be able to see in your screen a table like the presented in Figure 8. This table will allow you to evaluate things like the accuracy and precision of the classifier, and decide if it can be (or not) a good classifier to apply in this case.

References

http://en.wikipedia.org/wiki/RapidMiner
http://rapidminer.com/
http://www.kdnuggets.com/2013/06/kdnuggets-annual-software-poll-rapidminer-r-vie-for-first-place.html

.

.

.

.

    André Marques 
           Manager
Business Development
Blog