System Design for Software Engineers

Apache Kafka: Event Streaming at Scale

While RabbitMQ and SQS are great for traditional message queuing, they struggle with high-throughput event streaming (millions of events per second) and data retention. Apache Kafka is a distributed event streaming platform designed for high throughput, fault tolerance, and "replayability."

The key difference: In a traditional queue, messages are deleted after they are consumed. In Kafka, messages (events) are stored in an Immutable Log and can be kept for days, weeks, or even forever.

How Kafka Works

Topic: A category or feed name to which records are published.
Partition: Topics are divided into partitions for scalability. Each partition is an ordered, immutable sequence of records.
Producer: Sends records to one or more Kafka topics.
Consumer: Reads records from one or more Kafka topics.
Consumer Group: A group of consumers that work together to consume data from a topic. Each partition is consumed by exactly one consumer in the group.
Offset: A unique identifier for each record within a partition. Consumers track their progress using this offset.

Why Kafka?

High Throughput: Kafka can handle millions of messages per second by using sequential disk I/O and zero-copy data transfer.
Persistence (Replayability): Because Kafka stores events on disk, you can "rewind" your consumers to re-process old data (e.g., if you find a bug in your analytics code).
Scalability: You can scale a topic by simply adding more partitions.
Fault Tolerance: Partitions are replicated across multiple brokers. If one broker fails, another takes over.

The Lifecycle of a Kafka Record

1
Step 1
The producer sends a message (Key: user_123, Value: click_event) to the web_logs topic. Kafka hashes the key to decide which Partition the message goes to (ensuring all events for user_123 go to the same partition).
2
Step 2
The broker receives the message and appends it to the end of the partition's log file on disk. It assigns the message a new Offset (e.g., 501).
3
Step 3
The leader broker for that partition sends the message to its followers. Once enough followers acknowledge, the message is considered 'committed'.
4
Step 4
A consumer in a consumer group asks Kafka for the next message in Partition 0. Kafka looks up the consumer's current Offset (e.g., 500) and returns message 501.
5
Step 5
After processing the message, the consumer tells Kafka: 'I have successfully finished up to offset 501'. Kafka stores this information so that if the consumer restarts, it knows exactly where to resume.

Kafka vs. Traditional MQ (RabbitMQ)

Feature	RabbitMQ	Apache Kafka
Model	Push-based	Pull-based
Persistence	Messages deleted after ACK	Messages stored (Log-based)
Throughput	High	Massive (Scale of TBs/sec)
Ordering	Guaranteed per queue	Guaranteed per partition
Use Case	Complex routing, task queues	Logging, stream processing, real-time analytics

Common Mistakes

Too Few Partitions: Setting a topic to have only 1 partition. This means only 1 consumer can read from it at a time, limiting your horizontal scale.
Massive Messages: Trying to send 10MB messages through Kafka. Kafka is optimized for small messages (kilobytes). Use S3 for the large payload.
Ignoring Retention Policies: Setting your retention to 7 days but running out of disk space in 2 days because of unexpected traffic spikes.

Recap

Kafka is a Distributed Log, not a traditional queue.
Partitions are the unit of parallelism and scalability.
Offsets allow consumers to track their own progress and "replay" history.
Kafka is the backbone of modern Event-Driven Architectures.

Knowledge Check

Question 1 of 3

Q1Single choice

In Kafka, what happens to a message after it is successfully read by a consumer?

It is immediately deleted from the broker

It remains in the log until the retention period (time or size) is reached

It is moved to a 'dead letter partition'

It is sent back to the producer

Pub/Sub vs. Message Queuing: Patterns of Distribution

Designing for Reliability in Async Systems