Apache Kafka’s Role in Big Data Streaming Analytics


The world of Big Data started out as a way of storing and querying obscene amounts of information by comparison to what yesteryears were able to achieve. However, most value in said data is primarily found in the real time or streaming information that is presented as it first enters the system. As data gets old, it also gets stale and less useful to many business systems.

Streaming analytics platforms have come a long way with numerous open source projects offering advanced products such as Flink, Spark Streaming, Samza and Storm; which are all at the forefront of the arena in their respective strengths.

While each offers their benefits, we will explore Apache?s Kafka offering as a real-time data pipeline and streaming framework (Kafka Intro, 2018).

There are three major key capabilities that Kafka streaming provides:

Publish and Subscribe to Streams of Records

Data is ingested into Kafka as a Publish/Subscribe (often known as simply PubSub) model. This is similar to how messaging queuing systems work in traditional software architectures.

Large companies employ enterprise wide messaging systems as often times they were set up prior to more recent concepts around message queues and publish/subscribe learnings.

Store Streams of Records in a Fault-Tolerant way

Kafka uses Topics as a way of writing data into what is called a Kafka Broker. Each Broker node maintains a partitioned log that is an ordered sequence of continually appended items to a commit log (Jain, 2017). These partitions are logged across a cluster of servers in a distributed architecture.

Process Streams of Records as they occur

As data enters the Kafka cluster, listening Consumers pick up data that have been processed immediately to push down the line to dependant systems.

Data is formatted as a key, a value and a timestamp to position the record into a logical order.

Kafka guarantees that data received will always be in the order it has been sent. Moreover, consumers see records in the order they have been stored in the logs.

Fault tolerance caters for data replication (N) of any record block with a factorial proponent of N-1 per server failures without loss or corruption of commit data.

Real-time analytics has become a frequently used way of Delivering Advertisements, Tracking Abnormal Behaviour, Popularity Recommendations and Search based Relevance to name but a few (Edureka!, 2015).

Kafka has a simple architecture where all input systems feed data into Producers which normalise and pass on information to the Kafka Cluster where each Topic is processed and subscribed to by Consumers which are in turn sent the data immediately (near real-time).

(TutorialsPoint.com, n.d.)

An additional ZooKeeper Hadoop component is added to manage and coordinate all cluster nodes. It is primarily used to notify nodes about the status of other nodes and any faults that may have occurred (Cloudurable, 2017).

Being able to utilise data in its immediate form as it is captured and first enters the system is crucial to actioning targeted events to drive business value.

References

Kafka Intro (2018) Apache Kafka: Introduction [Online] Kafka.apache.org, Available from: https://kafka.apache.org/intro

Jain, A. (2017) Using Apache Kafka for fault tolerant systems [Online] Medium.com, Available from: https://medium.com/@arpitjay099/using-apache-kafka-for-fault-tolerant-systems-61814e6fab23

Edureka! (2015) Fault Tolerance with Kafka [Online] SlideShare.net, Available from: https://www.slideshare.net/EdurekaIN/fault-tolerance-with-kafka-56429907

TutorialsPoint.com (n.d.) Apache Kafka – Cluster Architecture [Online] TutorialsPoint.com, Available from: https://www.tutorialspoint.com/apache_kafka/apache_kafka_cluster_architecture.htm

Cloudurable (2017) Kafka Architecture: Kafka Zookeeper Coordination [Online] Cloudurable.com, Available from: http://cloudurable.com/blog/kafka-architecture/index.html