What is Big Data? There is a lot of confusion around what Big Data is nowadays. Different people have different perspectives around the definition and this has been a showstopper to Big Data projects as we first need to understand what it is, what it is good for and when it should be used. We’ll here describe our vision on that. Starting with some facts*:
90% of the data we currently have was created in the last 2 years;
80% of the world’s data is unstructured (facebook/twitter posts, foruns, reviews, news, etc.);
Only 20% of available data can be processed by traditional systems.
The main Big Data characteristics are the 4 V’s:
It’s made for big Volumes of data;
Created at a high Velocity;
Collected from a high Variety of sources and formats;
When you analyze all the data you have, instead of a part of it, you get a bigger degree of Veracity on the conclusions.
What is Hadoop? All the components in computers have improved a lot in the last 10-20 years (CPU speeds, RAM memory, Disk capacity, Network speeds) except for one thing – Disk Latency.
Based on this Doug Cutting and Mike Cafarella from Yahoo started working on a model to process data in parallel.
The idea is to have several computers processing and holding data instead of having of having just one supercomputer doing all the work. Hadoop is Apache open source software framework for reliable, scalable, distributed computing of massive amount of data. It consists of 3 sub projects:
Hadoop Distributed File System a.k.a. HDFS;
The MapReduce is how Hadoop understands and assigns work to the nodes (machines). Differently to the standard relational databases, here is the program that is sent to the data instead of the other way around. Data is processed and aggregated in the Workers and sent back to the Master when it is ready.
HDFS is Distributed, scalable, fault tolerant, high throughput file system where files are split into blocks and replicas are distributed among the nodes.
Data can be created, deleted or copied but not updated. Hadoop Common contains common utilities and libraries that support the other Hadoop sub projects. What is the IBM solution for Big Data? The IBM vision for Big Data is built open four pillars:
BigInsights - Hadoop
BigInsights brings the power of Hadoop to the enterprise and enhances this technology to withstand the demands of your enterprise, adding administrative, workflow, provisioning, and security features, along with best-in-class analytical capabilities from IBM Research.
Stream Computing is an Extreme Evolution in Data Analytics. It is used to analyze data in motion taking advantage of the speed of RAM memory and avoid retrieving the data from the disk. Programming in MapReduce or even on one of the main subprojects like Pig can be tricky as it requires special skills.
Since everybody is already used to SQL and that data warehousing augmentation is the leading Hadoop use case, IBM has created BigSQL to make it easier to interact with it. Data Explorer enables organizations to unlock, navigate and visualize Big Data from high value sources.
It combines enterprise content and unique Big Data sources such as log files, tweets, geospatial coordinates, imagery, etc., and allows the creation of applications that provide a 360º view of any topic (customers, products, employees, projects, etc.).
BigInsights Text Analytics is a powerful information extraction system that allows you to get metrics based understanding of facts from unstructured text. It is also able to do sentiment/social media analytics that helps companies understand how their products are being perceived in the market. *Source: GigaOM, Software Group, IBM Institute for Business Value.