Apache Kafka’s Role in Big Data Streaming Analytics

Andrew • Apr 17, 2020 • Big Data

2 min read 558 words

The world of Big Data started out as a way of storing and querying obscene amounts of information by comparison to what yesteryears were able to achieve. However, most value in said data is primarily found in the real time or streaming information that is presented as it first enters the system. As data gets old, it also gets stale and less useful to many business systems.

Streaming analytics platforms have come a long way with numerous open source projects offering advanced products such as Flink, Spark Streaming, Samza and Storm; which are all at the forefront of the arena in their respective strengths.

While each offers their benefits, we will explore Apache?s Kafka offering as a real-time data pipeline and streaming framework (Kafka Intro, 2018).

There are three major key capabilities that Kafka streaming provides:

Publish and Subscribe to Streams of Records

Data is ingested into Kafka as a Publish/Subscribe (often known as simply PubSub) model. This is similar to how messaging queuing systems work in traditional software architectures.

Large companies employ enterprise wide messaging systems as often times they were set up prior to more recent concepts around message queues and publish/subscribe learnings.

Store Streams of Records in a Fault-Tolerant way

Kafka uses Topics as a way of writing data into what is called a Kafka Broker. Each Broker node maintains a partitioned log that is an ordered sequence of continually appended items to a commit log (Jain, 2017). These partitions are logged across a cluster of servers in a distributed architecture.

Process Streams of Records as they occur

As data enters the Kafka cluster, listening Consumers pick up data that have been processed immediately to push down the line to dependant systems.

Data is formatted as a key, a value and a timestamp to position the record into a logical order.

Kafka guarantees that data received will always be in the order it has been sent. Moreover, consumers see records in the order they have been stored in the logs.

Fault tolerance caters for data replication (N) of any record block with a factorial proponent of N-1 per server failures without loss or corruption of commit data.

Real-time analytics has become a frequently used way of Delivering Advertisements, Tracking Abnormal Behaviour, Popularity Recommendations and Search based Relevance to name but a few (Edureka!, 2015).

Kafka has a simple architecture where all input systems feed data into Producers which normalise and pass on information to the Kafka Cluster where each Topic is processed and subscribed to by Consumers which are in turn sent the data immediately (near real-time).

(TutorialsPoint.com, n.d.)

An additional ZooKeeper Hadoop component is added to manage and coordinate all cluster nodes. It is primarily used to notify nodes about the status of other nodes and any faults that may have occurred (Cloudurable, 2017).

Being able to utilise data in its immediate form as it is captured and first enters the system is crucial to actioning targeted events to drive business value.

References

Kafka Intro (2018) Apache Kafka: Introduction [Online] Kafka.apache.org, Available from: https://kafka.apache.org/intro

Jain, A. (2017) Using Apache Kafka for fault tolerant systems [Online] Medium.com, Available from: https://medium.com/@arpitjay099/using-apache-kafka-for-fault-tolerant-systems-61814e6fab23

Edureka! (2015) Fault Tolerance with Kafka [Online] SlideShare.net, Available from: https://www.slideshare.net/EdurekaIN/fault-tolerance-with-kafka-56429907

TutorialsPoint.com (n.d.) Apache Kafka – Cluster Architecture [Online] TutorialsPoint.com, Available from: https://www.tutorialspoint.com/apache_kafka/apache_kafka_cluster_architecture.htm

Cloudurable (2017) Kafka Architecture: Kafka Zookeeper Coordination [Online] Cloudurable.com, Available from: http://cloudurable.com/blog/kafka-architecture/index.html

Tags: Big Data

Andrew

Andrew is a visionary software engineer and DevOps expert with a proven track record of delivering cutting-edge solutions that drive innovation at Ataiva.com. As a leader on numerous high-profile projects, Andrew brings his exceptional technical expertise and collaborative leadership skills to the table, fostering a culture of agility and excellence within the team. With a passion for architecting scalable systems, automating workflows, and empowering teams, Andrew is a sought-after authority in the field of software development and DevOps.

Apache Kafka’s Role in Big Data Streaming Analytics

Table of Contents

Andrew

Tags

Recent Posts

Rust Interoperability: Seamlessly Working with Other Languages

Edge Computing Architectures: Bringing Computation Closer to Data Sources

Automated Remediation: Building Self-Healing Systems for Modern SRE Teams

Load Balancing Strategies for Distributed Systems

Rust Performance Optimization: Techniques for Blazing Fast Code

Data Engineering Best Practices: Building Scalable and Reliable Data Pipelines

Rust's Ecosystem and Community: The Foundation of Success

Data Consistency Models in Distributed Systems

Building an AI Ethics and Governance Framework for Enterprise Applications

Containerization Best Practices: Building Efficient and Secure Container Environments

Machine Learning with Rust: Performance and Safety for AI Applications

Site Reliability Engineering Fundamentals: Building and Scaling Reliable Services

API Design for Distributed Systems: Principles and Best Practices

Game Development with Rust: Building Fast, Reliable Games

DevSecOps Implementation Guide: Integrating Security into the Development Lifecycle

Progressive Web Apps: Building the Modern Web Experience

Embedded Systems Programming with Rust: Safety and Performance for Resource-Constrained Devices

Monitoring and Observability in Distributed Systems

Capacity Planning for SRE: Building Reliable Systems at Scale

Event-Driven Architecture Patterns: Building Responsive and Scalable Systems

Apache Kafka’s Role in Big Data Streaming Analytics

Table of Contents

Share this article:

Related Articles

Tags

Recent Posts