Skip to main content

Command Palette

Search for a command to run...

Apache Kafka — The Backbone of Modern Data-Driven Systems

Published
7 min read

In the world of distributed systems, data moves at lightning speed.
Thousands of microservices generate millions of events — orders, payments, logs, GPS updates, transactions — all needing to be processed, stored, and analyzed in real time.

How do modern giants like Zomato, Uber, Netflix, and LinkedIn handle this constant flow of data without collapsing under the load?

👉 The answer: Apache Kafka — the most powerful distributed event streaming platform on the planet.


🚀 What is Apache Kafka?

Apache Kafka is a distributed, fault-tolerant, real-time event streaming platform that lets you:

  • Publish (write) data streams

  • Subscribe (read) data streams

  • Store them durably

  • Process them in real-time

In simple words:

Kafka is like a central nervous system for your applications —
continuously moving information between systems and microservices.


⚙️ How Kafka Works — The Core Building Blocks

Let’s understand Kafka’s architecture by breaking it down into its core components:

ConceptDescription
ProducerSends (publishes) data to Kafka topics
ConsumerReads (subscribes) data from Kafka topics
TopicA category or stream name where data lives
PartitionA subset of a topic used for scaling and ordering
BrokerA Kafka server that stores topic data
Consumer GroupA group of consumers that share the work of reading data

🧱 1. Topics

A Topic is a logical channel where messages are stored.

Think of it like a “table” in a database or a “queue” in a messaging system —
for example:

  • orders

  • payments

  • notifications

Each topic contains messages — events written by producers and read by consumers.


🧩 2. Partitions

Each topic is divided into multiple partitions — these are ordered, immutable logs of messages.

Each message gets a unique offset, which is like its line number in that partition.

Example:
Topic orders with 3 partitions

orders-0 → [order#101][order#104][order#107]
orders-1 → [order#102][order#105][order#108]
orders-2 → [order#103][order#106][order#109]

This partitioning allows Kafka to scale horizontally — multiple brokers and consumers can process data in parallel.


🧰 3. Brokers

A Broker is a Kafka server that stores data and handles requests from producers and consumers.

Each broker manages one or more partitions.
A cluster typically has 3 or more brokers for fault tolerance and scalability.

Broker 1 → stores partition 0
Broker 2 → stores partition 1
Broker 3 → stores partition 2

If one broker fails, Kafka automatically switches to another (replica) — no data loss.


🧩 4. Producers and Consumers

  • Producers write data to Kafka topics.

  • Consumers read that data from topics.

Kafka producers can decide which partition to write to (using a key).
For example, all messages for the same orderId go to the same partition — maintaining ordering.

Consumers, on the other hand, read messages in order from partitions.


⚙️ 5. Consumer Groups

A Consumer Group is a set of consumers that share the load of reading a topic.

Each partition is read by only one consumer in the group —
but different consumer groups can read the same topic independently.

Example:

Topic: orders (3 partitions)
Consumer Group: order-service (3 consumers)
Consumer Group: analytics-service (3 consumers)

✅ Each group gets its own copy of the data
✅ Each consumer in a group handles one partition
✅ Perfect for parallel processing


🔢 6. Offsets — Tracking Progress

Kafka tracks each consumer’s position using an offset.

An offset is like a bookmark — it tells Kafka:

“This consumer has read messages up to offset 10 in partition 2.”

Offsets are stored in an internal topic called __consumer_offsets,
so if your service crashes, it can resume from where it left off.


🔁 7. Replication — No Data Loss, Ever

Kafka ensures data durability using replication.

Each partition has:

  • One Leader (handles all reads/writes)

  • Several Followers (replicas)

If a leader fails, Kafka automatically promotes a follower to leader — zero downtime.

Example:

PartitionLeaderFollowers
orders-0Broker 1Broker 2, Broker 3
orders-1Broker 2Broker 1, Broker 3

⚡ Real-Time Example — Zomato’s Event Flow

Let’s visualize how a food delivery app like Zomato might use Kafka:

Scenario:

A user places an order on the Zomato app.

Flow:

User → Order Service → Kafka Topic: "orders"
             ↓
     Kafka brokers store and replicate the message
             ↓
Consumers:
  - Payment Service → reads "orders" → starts payment
  - Restaurant Service → prepares food
  - Delivery Service → assigns rider
  - Notification Service → sends SMS/email

All these services run independently, asynchronously, and in parallel
thanks to Kafka’s event-driven architecture.


🧮 Kafka in Numbers (Why It Scales)

MetricTypical Value
ThroughputMillions of messages per second
Latency< 10 milliseconds
RetentionConfigurable (hours, days, weeks)
Fault toleranceAutomatic leader election
Cluster size3 to 1000+ brokers

This is why companies use Kafka for mission-critical data pipelines.


🧰 Kafka Ecosystem — Beyond the Basics

Kafka’s ecosystem extends its capabilities far beyond simple messaging.
Let’s look at the 4 core components:


🧩 1. Kafka Connect

A framework to move data between Kafka and external systems like MySQL, MongoDB, Elasticsearch, and S3.

Example:

MySQL → Kafka (Source Connector) → ClickHouse (Sink Connector)

You can stream database changes into Kafka and push processed data into warehouses or dashboards.


🧠 2. Schema Registry

Ensures data consistency between producers and consumers.

It stores message schemas (Avro, JSON, Protobuf), validates them, and prevents incompatible schema changes.

Example:

  • Old schema → { orderId, status }

  • New schema → { orderId, status, paymentMode }

  • Schema Registry ensures old consumers don’t break.


⚙️ 3. Kafka Streams

A Java library for building real-time processing applications that consume from Kafka topics, transform data, and produce new topics.

Example:

builder.stream("orders")
       .filter((k, v) -> v.amount > 500)
       .to("high-value-orders");

Used for:

  • Fraud detection

  • Real-time analytics

  • Monitoring


💬 4. ksqlDB

A SQL interface for Kafka Streams — process streams using SQL instead of code.

Example:

CREATE STREAM high_value_orders AS
SELECT orderId, amount FROM orders
WHERE amount > 500;

Perfect for analysts or operations teams who prefer SQL over programming.


🧩 Putting It All Together — Real-Time Data Pipeline Example

Imagine a food delivery company’s architecture:

[ MySQL Orders Table ]
        ↓ (Source)
   Kafka Connect
        ↓
Kafka Topic: orders
        ↓
+-----------------------------+
| Kafka Streams / ksqlDB      |
|  → Filter high-value orders |
+-----------------------------+
        ↓
Kafka Topic: high-value-orders
        ↓
Kafka Connect (Sink)
        ↓
[ Elasticsearch / S3 / Data Warehouse ]

✅ Fully automated
✅ Real-time
✅ Scalable
✅ Fault-tolerant


⚡ Why Companies Love Kafka

FeatureBenefit
ScalableAdd brokers & partitions easily
DurableStores data on disk with replication
Real-timeStream processing with low latency
ReliableHandles broker or service failures
FlexibleWorks with microservices, analytics, IoT, etc.
ReplayableCan re-read old data anytime
DecoupledServices don’t depend directly on each other

🧱 Real-World Use Cases

IndustryUse CaseExample
🚗 Ride-sharingLive trip & driver updatesUber, Ola
🍕 Food deliveryOrders, payments, trackingZomato, Swiggy
💳 FintechFraud detection, transactionsPaytm, Razorpay
🎬 StreamingReal-time viewer analyticsNetflix, YouTube
🛒 E-commerceInventory & order pipelinesFlipkart, Amazon

🧠 Final Thoughts

Apache Kafka isn’t “just a queue.”
It’s a real-time distributed data backbone that powers some of the biggest systems in the world.

It enables:

  • Real-time communication between microservices

  • Stream processing for analytics

  • Scalable event-driven architecture

Whether you’re building a small microservice or an enterprise-scale data platform —
Kafka will likely be at its core.


💬 In one line:

Kafka is not just about messaging —
it’s about real-time, fault-tolerant, scalable data pipelines that connect everything in your ecosystem.