26 October 2016

Talend Data Integration Studio

What is Talend?

This article has as main goal to show some of the main capabilities and functionalities of Talend Data Integration as well as the complete management of its life cycle as ETL tool.

Althought this tool exists in the market for some years, in the last four it has a big salience and demand much higher than its direct competitors, being currently considered by Gartner as visionary in the Data Integration component.

Another big advantage of this tool is the fact that Talend is Amazon AWS partner and one of the main recommendations of Amazon to use with Redshift and EMR (Big Data).

One of the main advantages of this tool is the fact that it has an open source version with connectors to all main source systems (https://www.talendforge.org/components/index.php) besides that we have in the same tool and in an integrated way, data management, master data management, data quality, big data and application integration.

All the components are managed in an integrated way in Talend Administration Center (TAC), where is also possible to schedule executions of jobs created in Talend without the need of deploy once that TAC it is connected to the repositories where jobs are developed.

Repository Manager is the tool that allow us to manage migrations between environments and which versions of the jobs will be migrated.

The below tables explain the several options of editions and components that each version of Talend has.

How to create jobs in Talend?

The first step is to install the tool (https://www.talend.com/download/talend-open-studio), we can use open studio where we will only have the studio to develop available or the enterprise version which includes the studio and the TAC to manage all components of the platform.

In the open studio version we need to use 3rd party software to schedule the execution of developed and exported jobs. This way every time that there is some change to a given job we will have to export again the latest version of the job to schedule its daily execution.

In the enterprise version there is the TAC to schedule the job execution once that TAC is connected to the repositories where all developed jobs and all versions are, every time those jobs are changed is not necessary to do a new job deploy.

In the enterprise version, Talend should be installed in the server (including TAC), TAC is connected to a repository (SVN, GIT and Nexus) and Talend Studio will be installed in the local computer which will be connected to the repository via TAC. This way, all of jobs created will immediately be available at TAC and ready to be scheduled.

One of the main good practices is to use contexts in Talend, local or global variables for each job, which can be transmitted to a jobs tree between each connection of parent-child job. To do so, it is enough to export as context the links to the data source or any external repositories used.

In the job, we should use contexts to create connections to the source systems. This way, we can receive different variables pointing to different environments of the parent jobs.

To create master jobs that will execute sub-jobs in a parallel or synchronized way, there is the object tRunJob that allows to instantiate contexts from a master job to its sub-jobs until the lowest level of Talend jobs.

How to deploy jobs in Talend?

In the enterprise version, the TAC is connected to the repositories where the developed jobs are located, so, just schedule the intended jobs at Job Conductor of TAC.

In the open studio version, we need to use third-party tools to schedule the automatic execution of jobs. One of the options is Windows Task Scheduler and the steps are the following:

1- Create a build and export the job (select option extract zip file)

2 - At Windows task scheduler create a new scheduling and use the file ‘.bat’ generated in the previous step.

.

.

.

.

  Marcos Fernandes
         Consultant
Blog