27 October 2017

Apache Kafka – Introduction to a Streaming Platform

1. Apache Kafka – What is it?

Apache Kafka is a distributed streaming platform similar to a queue of messages or an enterprise messaging system. It was developed to provide a real time data feed with a low latency and high throughput, with the main purpose of building streaming data pipelines between systems to get data and runs as a cluster on one or more servers and stores streams of records in categories called topics.

2. The Beginning

The Foundation of Kafka was born at LinkedIn and later became an open sourced Apache project in 2011. Kafka was initially taking form because LinkedIn engineers needed a way to make possible the stream of messages in big scale, so they developed the foundation that later gave birth to Apache Kafka.

The team at LinkedIn needed the complete redesign of the infrastructure to accommodate the growing membership and increase the site complexity. The first approach was migrating from a monolithic application infrastructure to one based in micro services and developing different custom data pipelines for the various streaming queuing data. They had various use cases in the platforms, for example tracking site events like page views to gather aggregated logs from other services or pipelines providing queuing for the In Mail messaging system, etc. These needed to scale along with the site so rather than maintaining and scaling each pipeline individually, they invested in the development of a single, distributed publish-subscribe platform, and with this, Apache Kafka was born.

3. Some Use Cases

• Website Activity Tracking

A way to track all the activity in a website, this means that every action users may take per example page views or searches is published to a central of topics with one topic per activity type. This activity involves a high volume of data because of the number of messages that are generated for each user page view.

• Messaging

Has a better throughput, built-in partitioning, replication and fault-tolerance in comparison with most messaging systems, which makes it a good solution for a large-scale message processing applications.

• Log Aggregation

Is a strong choice to use when we need to collect logs from multiples services across an organization and make them available in a standard format to multiple consumers.

• Stream Processing

Has a strong durability and in terms of processing, it reads data from a topic in a raw state and transforms or processes it to later write it in new topics for further consumption or follow-up processing.

• Metrics

Can be used for operational monitoring data, meaning that it can aggregate statistics from distributed applications to produce feeds of operational data or alerts.

4. Architecture and some Keywords

Kafka consists in Topics, Brokers, Partitions, Leaders and Replicas, these are the main keywords in the working process of Kafka.

A Topic is a category or a feed name to which records are published, it means every topic has different information from each other, and Kafka maintains order in topic but not between topics. Topics are divided into a number of partitions that allow parallelizing a topic by splitting the data in a particular topic across multiple brokers.

A Broker in other hand has the responsibility to manage and replicate the messages that enters in the Cluster, holding a number of partitions and each of these partitions can be either a leader or a replica for a topic. Writing and reading processes go through the leader and the same coordinates updating replicas with the new data, concluding if a leader fails, a replica takes over as the new leader.

Then we have the major API’s to communicate with the Apache Kafka:

• Producer
Has the job to push messages into the Kafka Topics, allowing applications to publish streams of records. Writes to a single leader, providing the means of a load balanced production so that each write process can be serviced by a separate broker and machine.

• Consumer
Pulls messages off a Kafka topic, allowing the application to subscribe to the same topic and processing the stream of records. Reads from any single partition, giving the option to scale throughput of message consumption in a similar fashion to a message production. It also can be organized into consumer groups for a given topic, each consumer within the group reads from a unique partition and the group as a whole consumes all messages from the entire topic.

Kafka gives certain guarantees about data consistency and availability when we are producing to one partition and consuming from one partition, not applied when reading from the same partition using more than one consumer or writing to the same partition using more than one producer.
- Taking in mind these requirements, Kafka guarantees:
- When messages are sent to a topic partition they will be append to the commit log in the order they are sent;
- A single consumer instance will see messages in the order they appear in the log;
- A message is committed when all coordinated replicas have applied it to their log;
- Any committed message will not be lost, as long as at least on coordinated replica is alive.

• Writing

In the writing process, all messages are sent to the leader of a partition and the leader is responsible for writing the message to its own coordinated replica, and once that message has been committed, the leader is responsible for propagating the message to additional replicas on different brokers. Each replica acknowledges that it received the message and so they can be called in sync, having in mind that sometimes leaders or replicas can fail.

• In Failure

If a replica fails, writing will no longer reach the same replica, failing further and further out of sync with the leader.
In a case of all replicas failing, we have two options, first we can wait until the leader is back up before continuing, after that, it will begin receiving and writing messages to the replicas as they are brought back online, syncing messages between them. The second approach is to elect the first broker to come back as the new leader, this broker will be out of coordination with the existing leader and all the data written between the time where this broker went down and was elected the new leader will be lost. Electing a new leader, messages may be dropped but we will minimize downtime, as any new machine can be leader.
Finally, in a case of a leader failing, the Kafka controller will detect the loss and will elect a new leader from the pool of in sync replicas. This may take a few seconds but no data loss will occur as long as producers and consumers handle it, retrying the connection.

5. Conclusion

When choosing a streaming platform, Kafka is a strong and solid choice due to its capabilities of parallelism, and that way we have a separation of data consumers and data producers, making our architecture more flexible and adaptable to change. It allows a large number of permanent or ad-hoc consumers, is highly available and is resilient to node failures, supporting automatic recovery.

Kafka brings a solution for Big Data and the Internet of Things by having a way to process a high volume of data giving a strong option and highly recommended messaging system to an organization’s data pipelines.

Here are some companies with the Apache Kafka system:

     

     

 

 

      Sílvio Moura
  Junior Consultant
Blog