26 September 2017

What is Big Data?

When we started talking about Big Data, it was a modern technological term that was thought it would be spoken for only a little while, until the next great technological success came. That was not the case and right now, many of the technological buzzwords have Big Data as the driving force.


From the creation of the first computers in the 1940s until the first release of Hadoop, only 60 years passed. This period defined what Business Intelligence is, the first personal computer appeared, Lotus 1-2-3 appeared, followed by Excel where people started to make their analyses better and were able to collect their data. Soon afterwards, in the 90's, Dashboards and Advanced Analytics appeared and, through BI tools, an easy and presentable access to information was given.

Then the Internet happened and everything changed.

The number of computers and internet users have increased, it was created technology capable of capturing information from the real and physical world we live in and converting it into digital data (IoT). We have generated huge amounts of information never seen before, as an example, from 16 million internet users in 1995, we have moved to 3.8 billion in 2017. We are constantly generating data, whether we carry our smartphones equipped with GPS, when we communicate with our friends through social networks or when we make purchases. There is more and more of a digital trail in everything we do. Machines and factories all over the world are increasingly equipped with sensors that gather and transmit data. An airplane engine on a trip from London to Singapore can generate up to 1PB of sensor data and an airplane usually has four engines.

Today we live in an "age" where, in a half year, we generate the same amount of data already created by humanity at all times.

What, then, is Big Data? It is a reference to the absurd (Big) amount of data (Data) and a set of technologies that is evolving, which allows access to information in ways that were not possible, even a few years ago.

What can Big Data do?

Big Data's application strides go far beyond customer experience, for example, it can reduce costs, streamline processes, forecast maintenance needs or increase security in IT infrastructure.

Having access to key information, such as market trends, before our competitors, can determine success or failure in the corporate universe. And this is where the secret of success comes in working with Big Data.

According to a recent IDC study, the Big Data market is expected to grow 600% more than IT by 2018

More and more data is generated in the form of images and videos - from satellite images to photos uploaded to Facebook or Twitter - as well as email, instant messaging and recorded phone calls. This unstructured form can easily be put into structured tables with rows and columns, but it is still necessary to understand this data, so some Big Data projects often use cutting-edge analyses involving artificial intelligence and machine learning for image recognition or natural language processing, for example - to learn to detect patterns much faster and more reliably than humans. Several technologies have been developed and many others are being developed, it is a constant process of technological evolution to be able to deal with the growing need for interpretation of the existing data.

The Pão de Açúcar Group started using data analysis tools in 2015 to retain its Customers. The system identifies old Customers who have stopped going to their stores and then performs an analysis of the preferences of each one of them. This breakthrough allows the company to target customized campaigns, offering special and distinct promotions to each customer and thus encourage consumers to return to their stores.
Each Marketing action should be accompanied by social media monitoring tools because, if a campaign does not have the expected effect or, worse than that, it generates negative feedback, that failure must be detected quickly so that the company is able to take the corrective measures.

Monitoring the behaviour of a population in social networks - in line with the aggregation of field survey data and statistical analyses, can for example help to anticipate the possibility of an outbreak of epidemics, giving health institutions time to adjust to an increased demand for medical help or medicines.

Why now?

A published Google paper ("Google File System") in 2003 was the genesis of Hadoop. In 2006 came the first release of the software and, 2 years later, Yahoo was already carrying its clusters, 10TB of data per day. People and companies have believed in the project, companies like Facebook, LinkedIn, eBay and IBM have contributed and still contribute thousands of lines of code to the project. Now, for example, Yahoo's cluster has 42k nodes and hundreds of PBs in storage.

Hadoop software is a framework that enables distributed processing of large volumes of data across multiple computers using simple programming models. It is easily scalable, where each machine offers space and computing capacity. It was designed in a way to detect and act on the machines that fail in the cluster, guarantee high availability and all this for a lower cost compared to the current architectures.

In addition to being open source, it is supported by dozens of large companies and thousands of programmers worldwide who contribute to the development of the project and the emergence of new technologies.

With this storage and processing capability, a company today, instead of taking the option of not loading all one-year events for lack of capacity, can now carry years of history and still access information.

How does it affect me?

When we enter the Big Data ecosystem, it is a set of technologies and not a single product. These are several open source components developed for specific purposes that, together, allow the Big Data ecosystem to work. It is not an ERP, it is not a Data Warehouse. They are technologies that can complement the systems that exist, or break the ground for new ideas and systems.

Comparing Big Data with Business Intelligence, we can verify that they are not identical, but they complement each other:

Business Intelligence (BI)
• It is oriented to the collection, transformation and availability of structured data;
• Analyses what already exists;
• Ideal for when you already know the variables (dimensions) for the questions;
• It is more specific;
• Typically reflected in creating a Data Warehouse.

Big Data
• Focused on the processing of structured and unstructured data, as well as on the correlations and discoveries that may arise from this processing;
• Analyse what already exists and what is to come, discovering new paths;
• Ideal for exploring new possibilities, discovering new patterns and exploring questions that have not been asked yet;
• Broader, geared not only to business but also to any other area/segment.

We can use technology for storage purposes only, with the advantage of being cheaper, accessible, and thus not discarding information. We can use technology to serve as staging for a Data Warehouse. We can use technology to process events in real time or to create statistical models with a high amount of data or variables. The coexistence of these two worlds is the path.


This "Big Data" area will allow Data Scientists to dive into data, look for patterns, and create models.

A logical data warehouse approach, consisting of an enterprise data warehouse and a Big Data component, with an analytical layer to facilitate analysis across the architecture to answer the questions.

The right questions matter, many of Big Data's projects fail because the end result did not make much difference facing what already existed. Without the right questions, the desired knowledge is not achieved.


It is not a new system or a product that was created to replace something that already exists and is consolidated. This is a technological evolution, a set of tools, which besides allowing access to information as never before, is open source.

In my opinion, it is another path that we can follow, and that will mark us in the following times. More than replacing systems, Big Data will complement those that already exist.

The amount of data available will only increase, and analytical technology will become increasingly capable. So, if Big Data is capable of all this today, imagine what it will be capable of tomorrow.





      Pedro Duran
Business Development