Most of the companies that work with (Big) Data face a real problem when it comes to different sources and types of information to ingest to their systems. One of the main concerns is related to unstructured data, normally provided in worksheets by Business Users, relational source systems without proper data management rules, among others. Treating this master data with Excel is almost impracticable. Despite the fact that this issue also happens in a traditional BI model, the focus of this article is to show it from a Big Data point of view, also addressing Talend Open Studio for MDM. The greatest change that MDM brings to the Big Data ecosystem is the possibility to integrate the outputs created by MDM tables into HDFS, allowing to delete and update rows easily and cleanly. The other big opportunity is that is possible to keep track of changes made by the users to the MDM tables.
Some of the previously mentioned data is mostly classified as master data. It can refer to multiple core business entities such as Customers, Suppliers, Employees, Products, Assets, etc.
Master Data Management systems were created to help companies manage and consolidate the type of information described above. In general, they should meet some key requirements such as:
• Definition and maintenance of metadata for master data entities in a repository
• Acquire, clean, remove duplicates and integrate master data into a central master data store
• Offer a common set of shared master data services for applications, processes and portals to invoke access and maintain master data entities i.e. system of entry (SOE) MDM services
• Manage master data hierarchies including a history of hierarchy changes and hierarchy versions
• Manage the synchronization of changes to master data to all operational and analytical systems that use complete sets or subsets of this data
The MDM systems described in this article, are being embraced more and more by Organizations, to control their master data and improve business performance. These companies realize that, without MDM solutions, their master information is more prone to have duplicated and fractured data across multiple operational systems and stored in more than one system. This situation leads to difficulties to understand which data is the source of true and if/how the data gets synchronized across systems.
Important vendors such as DataFlux, IBM, Talend, Informatica, and Sypherlink are betting on this type of tools. Some tools available in the market nowadays are:
• Hyperion MDM
• IBM WebSphere Product Center and Customer Center
• Kalido 8M
• Oracle Customer and PIM data hubs and Sunopsis AIP
• SAP NetWeaver MDM
• Talend Open Studio for MDM / Talend MDM Platform
On the rest of the article, we will focus on the installation and present a real Use Case for master data management using Talend Open Studio MDM tool and Talend Web User Interface.
1.2. TALEND OPEN STUDIO FOR MDM (INSTALLATION)
It is important to add that, regarding the installation manual, there is a lot of dispersed information that is currently not aggregated in the same place on the internet, which makes this manual very relevant.
Talend has two different MDM tools available:
1) Talend Open Studio for MDM – free and open source tool developed by Talend with a lot of interesting features such as:
• Design and productivity tools: Eclipse-based developer tooling and job designer, export and execute standalone jobs in runtime environments, embedded data validations, and business rules, automatic data integration with MDM models;
• MDM Web Application: Master data repository, fully functional MDM environment, complete Web UI for master data management, model-driven user interface;
• Connectors: Cloud – Amazon Web Services (AWS), Microsoft Azure, Google Cloud Platform; RDBMS – Oracle, Teradata, Microsoft SQL Server; SaaS – Marketo, Salesforce, NetSuite; Packaged Apps – SAP, Microsoft Dynamics, Sugar CRM; Technologies – Dropbox, Box, SMTP, FTP/SFTP, LDAP; Web services: SOAP, REST/HTTP;
• Components: Control and orchestrate data flows and data integration with master jobs; basic matching and grouping of entity; map, aggregate, sort, enrich and merge data;
2) Talend Master Data Management Platform – under a subscription licensed mode. It has all the tools available on Talend Open Studio for MDM plus a few more:
• Data quality and Governance: data profiling and analytics with graphical charts and drill-down data, automate data quality error resolution and enforce rules, data masking;
• Data preparation and Stewardship: import, export and combine data from Excel or CSV file, export to Tableau, self-service on-demand access to sanctioned datasets;
• Master data management: Visual modelling and import/export of data models, integrated workflows for data stewardship and governance; MDM query language to consume REST data access; Master data full text search and ad-hoc queries, impact analysis, audit trail and dependency enforcement; MDM activity monitoring dashboard, Multiple and recursive hierarchies’ management; Role-based security and Active Directory integration;
• Advanced Data Profiling: Fraud pattern detection using Benford Law, column set analysis, advanced matching analysis, time column correlation analysis.
On this article, we will explain the installation of Talend Open Studio for MDM version and Talend MDM Server.
The mentioned version is available on Talend website:
1.1.1. STEPS TO INSTALL TALEND OPEN STUDIO FOR MDM
The download has two files. You will need to unpack the ZIP file to a specific location on your PC or server:
The downloaded version is TOS_MDM_Studio 6.4.1 which we unzipped it into our main C: drive:
1.1.2. STEPS TO INSTALL TALEND MDM SERVER
1) When you run the exe file you will be prompted with the warning of the Java Platform, allow it (Talend is Java based and it is needed to run Tomcat application server):
2) Click OK to select the installation language:
3) Click next to start Talend MDM Server 6.4.1 installation:
4) Click Next to accept the terms of the license agreement:
5) Click Next after you read the information regarding Java and MIT Licence:
6) Select the packs to install:
This step is important because, here, you decide if you want to install both Talend MDM application and Apache Tomcat Server. If you already have Apache Tomcat installed on your server/machine, you will not need to install it again. In this case, since we installed Talend MDM on our local machine, we installed the Apache Tomcat for MDM Server as well.
6) Select the installation path for the MDM Server:
8) Define the port for the MDM Server service:
9) Select the database type (H2 Embedded is the only option available):
10) Define the username and password to access the database:
11) Define the database index directory:
12) Finish the installation agreeing with the installation packs and path:
Important: for the MDM Server to work, we need to guarantee the variable JAVA_HOME is pointing to the correct location of the Java Runtime Environment installation as shown below:
To start the MDM server, right click on the catalina.bat file and Run with Elevated Privileges:
After the server starts with success you will receive the message of Server startup as the image above.
TALEND OPEN STUDIO FOR MDM – USE CASE
On the following Use Case, we will show an example of the ingestion of a manual table via an excel file.
First, it is important to define the key terms used on Talend Open Studio for MDM and Talend MDM Web User Interface. We will focus on the most important, which are explored below:
1) Data Container: holds data of one or several business entities. Data containers are typically used to separate master data domains.
2) Data Model: defines the attributes, user access rights and relationships of entities mastered by the MDM Hub. The data model is the central component of Talend MDM and maps to a single entity that can be explicitly defined.
a. Entity: describes the actual data, its nature, its structure and its relationships. A data model can have multiple entities.
b. Record: an instance of data defined by a data model in the MDM Hub. For example, two records that are considered similar, or a close match, may be merged.
3) View: a complete or subset view of a record. A complete view shows all elements or columns in an entity, while a subset view shows some of the elements or columns of an entity. A view may restrict access to attributes of a record depending on who or what is asking for the data.
For this Use Case, we defined a Data Model, Data Container and a View with the name of the ingestion table jde812_m_route_code.
1) Data Container
2) Data Model
Inside the Data Model, we defined an Entity called jde812_m_route_code with several Business Elements to be aligned with the data ingested from the excel file, including:
r) jde812_m_route_code_id (key)
As previously mentioned, when you define a view you can select which Business Elements will be visible on the Talend MDM Web User Interface. For this Use Case, we maintained all the business objects visible.
DEPLOYING OBJECTS TO THE MDM SERVER
After the creation of the objects on Studio, we need to deploy them into the Talend MDM Server. Below are the needed steps to publish studio objects into MDM server.
1) Setup the connection from Studio to the server:
This is done on the Server Explorer tab existent on the bottom part of studio interface.
2) Publish the objects into Talend MDM Server:
TALEND MDM WEB USER INTERFACE
After we publish the objects created on Talend Studio, we can import the model on Talend MDM Web User Interface.
First, we log into the Web User Interface:
The three most important views on the Web User Interface are:
1) Welcome: this is the default page when you enter the Web UI
Important: on the right side of this view, we can already see the Data Container and Data Model uploaded to the server.
2) Master Data Browser: on this view we can see, delete and update all the records that belong to each Entity. We are also able to import and export records for the selected view (explained below).
Note 1 – The image above is blurred since it’s based on real data
IMPORT AN EXCEL FILE INTO TALEND MDM WEB USER INTERFACE
The process to import an excel file, after the creation of the objects on Talend Studio, is very simple and described below:
a) On Master Data Browser, select Import
b) Browse the file to import:
c) Click Submit after selecting the file:
Success message is displayed
Data is now available on the Web User Interface, and end users can create new records, update or delete existing ones:
Note 2 – The image above is blurred since it’s based on real data
3) Journal: on this view, we can track all the changes applied to a specific Data Model and/or Entity, filtered by date, operation type, source or key:
Note 3 – The image above is blurred since it’s based on real data
POTENTIALITIES / ADVANTAGES USING TALEND MDM
(Master) Data is one of several pillars Organizations stand on to achieve success. Exploring tools that give them more control over their data is, or should be, on the priorities of every company. Our experience, working on a project with one of the major pharmaceutical companies in the world, is that the better we control the dispersed information that source management reports or dashboards, more chances we have to obtain better insights over the information provided. Talend MDM is one tool available on the market that provides this type of control, in several ways such as:
1) Security: having the data centralized with Talend MDM tools, allow us to have one central repository with controlled data;
2) Change logs: Talend MDM Web User Interface has a journal that records every change in data;
3) Maintenance: Talend MDM Web User Interface allows to maintain data – create, update and delete – directly on the web interface, being a more controlled environment to apply data changes;
4) Import/Export: Talend MDM Web User Interface has several connectors allowing the import and export of information with multiple applications.