23 January 2018

Amazon EMR

1 – What is Amazon EMR?

Amazon EMR (Elastic Map Reduce) is a Big Data SaaS (Software as a Service) and its storage is in Amazon’s cloud that allows to process in an easy and economical way huge amounts of data.
Its base structure is Apache Hadoop and it executes over the scalable virtualization infrastructure of Amazon Elastic Compute Cloud (Amazon EC2).

EMR has the capacity of start/stop or increase/decrease the processing nodes that allows supporting the needed elasticity for the data volume processing increase or decrease, this way only the real processing time will be invoiced by Amazon.

The Big Data concept is all about collecting, storing, processing and visualizing huge volumes of data, structured, non-structured and semi-structured so that the organizations can take the best benefits of this data and add value to their business as quick as possible to support the decision-making process.

The main operational challenges in analytical platforms include the setup, operational management and dynamic allocation of processing capacity to accommodate variable data loads, as well as data aggregations from several sources and distinct types.

Apache Hadoop and its tool ecosystem help to solve these constraints due to Hadoop horizontal expansion capacity to accommodate a growing data volume.

AWS EMR is easily configurable in minutes, and without more complexity, the system is ready to be used.

AWS was considered leader in IaaS by Gartner in June 2017.

                                          Gartner Magic Quadrant for IaaS - 2017

2 – Main advantages of AWS EMR

2.1 Elasticity
2.2 Low Cost
2.3 Scalability
2.4 Security
2.5 Easy to configure/use

2.1 Elasticity

• Capacity provisioning as much capacity as needed in one given moment, in both CPU and memory.
• It is possible to add/remove CPU/memory capacity at any given time.


2.2 Low Cost

  • Low cost by each hour of usage/processing (check https://aws.amazon.com/emr/pricing/).
  • Integration with Amazon EC2 in Spot mode or reserved instance (the most advantageous in terms of costs):
  • On-demand instance – Computation capacity paid by hour, without the payment of the upfront value;
  • Reserved Instance – Payment of an initial value (upfront value) and discount in the hourly rate value by computation time;
  • Spot Instance – Possibility of doing an offer to Amazon by EC2 computation not used by other customers and allocate that computation value having an hourly rate value much lower than the default one.


2.3 Scalability

The type of the server (CPU/memory) defined in the EMR cluster can be changed at any time, that allows to easily escalate the infrastructure to what is necessary for the data processing at any given moment.

2.4 Security

Besides AWS firewalls, all the communications between EMR, S3 or other Amazon databases as well as to external systems are done using SSL (server side encryption), this allows a better security during the data transfer.

The EMR cluster starts with different security groups to Master Node and Slave Nodes. The Master Node has a communication port to communicate with the EMR service and one SSH port to the EC2 Slave instances that use the security key defined in the creation of the cluster.

The Slave Nodes only interact with the Master Node by SSH and the Amazon AWS security groups can be created/configured, that allows containing different access rules.

2.5 Easy to configure/use

The setup of the EMR cluster takes in average 10 minutes, just select the server type for the cluster, define the security groups that will be applicable, the instance type (on-demand, reserved, spot) and the setup is finished.

3 – Flexible data storage in Amazon EMR

It allows separating the storage capacity from the processing capacity, this allows having parallel processing capacity between the several nodes that EMR cluster contains, this is not possible in on premise or Hadoop over EC2.

In the image below, we have the main tools to data storage that are interconnected with EMR.

                                              Data storage tools for EMR

4 – Amazon EMR role in Amazon data pipeline

The EMR has a key role in the Amazon data pipeline, its main purpose is to process the inbound data for the Amazon infrastructure so that, after pre-processing and aggregating, the data can flow to the relational databases, S3 or Amazon Redshift.

                                                EMR role in AWS pipeline

5 – Amazon EMR Architecture

                                                      EMR Architecure



    • EMR consists in one Master Node and one or more Slave Nodes
      • Master Node
        • Currently EMR does not support automatic failover in the master node or the recovery of it, in case of failure.
        • If the Master Node becomes unavailable, the cluster will be terminated and the job that is being executed will need to be reprocessed.
      • Slave Nodes – Core nodes and Task nodes
        • Core nodes
          • It holds the data persistency using Hadoop Distributed File System (HDFS) and it executes Hadoop tasks.
          • The number of core nodes can be increased/decreased in the existing cluster or in a cluster under execution.
        • Task nodes
          • Only execute Hadoop tasks.
          • The number of task nodes can be increased/decreased in the existing cluster or in a cluster under execution.


EMR is highly tolerant to failures in slave nodes process executions, therefore it continues with the process executions in the case of slave node failure in the cluster. Currently EMR does not provide another node in case of slave node failures.

EMR supports bootstrap that allows users to execute customized setups prior to the cluster programmed execution. It can be used to necessary setup before the regular process executions, mainly to get execution parameters before the programmed process executes.

6 – Supported Apache Hadoop Tools

EMR supports several tools from the Apache Hadoop framework, when the cluster is created it is possible to select the tools we want to make available in the cluster.

                                                EMR cluster creation – Tool selection


6.1 – Hive

Hive is a set of open source analytical tools that execute over Hadoop. The execution operations are done using Hive QL language, a language based in SQL that allows users to have an easier interaction with the data. Its focus is data analysis.

6.2 – Pig

Pig is an open source analytical tool that execute over Hadoop, its programming language is Pig Latin, a language as SQL that allows the needed data structure/aggregation over the data. Its focus is to process the data from complex and non-structured sources such as text documents or log files.

6.3 – Hbase

HBase delivers a more efficient way to store huge volumes of spread data, using columnar data storage. HBase allows quick lookups over the data because the data is stored in memory instead of disk. It is optimized for the sequential writing operations and it is highly efficient in bulk format inserts, updates or deletes.

6.4 – Impala

Impala is a tool from the Hadoop ecosystem for interactive queries using SQL syntax.

It uses the Massively Parallel Processing (MPP), which is a database analytical engine that is similar to a traditional RDBMS database. This allows Impala to perform low latency analytical tasks over the data, using BI tools with ODBC/JDBC connections.

6.5 – Presto

Presto is a distributed open source SQL engine to execute interactive analytical queries against any volume databases, since gigabytes until petabytes.

6.6 – Hue

Hadoop User Experience (Hue) is a web interface to Hadoop that facilitates the development and execution of Hive queries, manage files in the HDFS, develop and execute Pig scripts as well as data model management (databases, schemas, tables). Also allows the execution of SQL over files stored in S3.

This way, and using S3 as HDFS, the user can access raw data in the exact same format EMR system receives it from the source system.

Using Hue, it is possible to give access to advanced users with data analysis requirements over the raw data after the aggregations made by EMR.

That access is granted without compromising the data warehouse database performance, a traditional group that has this kind of needs are data scientists.

Besides that, it also allows the data consolidation in the EMR jobs using SQL instead of python that is the default language for MapReduce in Hadoop.


                                        External table creation over AWS S3


In summary, EMR is one of the options for data consolidation and persistency, mainly if it is used along with S3, that way it is possible to easily create a robust data lake with excellent performance even for near real-time processing due to its scalability.

Besides that, its cost, in comparison with Hadoop or EC2 based options is significantly lower, that is currently leading more companies to change to EMR.

The ease of setup and its time to market are also very important topics, mainly for the IT decision makers that see EMR as an excellent option to quickly deliver value to their internal business customers.