Apache Kafka - Part 1
In this post, I share my notes on Apache Kafka, inspired by the teachings of Stephane Mareek's course. Enjoy!
Topics, partitions and offsets
- Topic is a particular stream of data.
- A kafka cluster may have many topics.
- The sequence of messages is called a data stream.
- Topics are split into partitions. Messages within each partition are ordered and indexed with id's. The id's will keep on increasing as new messages come.

- Kafka topics are immutable, once the data is written it cannot be changed.
- Data is kept only for a limited time. (default is one week but its configurable)
- Order is guaranteed only within a partition
- Data is assigned randomly to a partition unless a key is provided.
Producers
- Producers write data to topics.
- Producers know which partition to write to (and which Kafka broker has it)
- Producers can choose to send a key with the message (string, number, binary etc.)
- If key is null, partition is chosen by round robin
- If key is not null, then all the messages for the same key go to the same partition. (hashing)
- How does a kafka message look like ?

- The key and the value are serialized before they are sent to kafka, consumer then deserializes them after consuming the message (meaning the message is consumed as bytes).

Consumers
- Consumers read data from a topic - pull model.
- Consumers know which broker and partition to read from.
- In case of broker failures consumers know how to recover.
Consumer Deserializer
- Deserializes binary data to its original form.

- Consumer should know what kind of data is being sent through the topic, this means that we can only send a single type of data through a topic and can’t change it through its lifetime.
Consumer Groups
- All the consumers in an application read data as consumer groups.
- Each consumer within a group reads from exclusive partitions.

- If you have more consumers than partitions, some consumers will be inactive.

- Multiple consumer groups can be on the same topic.

- Consumer groups can be thought of as services.
- Consumers know which group they belong via their group.id property.
Consumer Offsets
- Kafka stores the offsets at which a consumer group has been reading.
- When a consumer in a group is processing data received from kafka, it should be periodically committing the offsets.
- If a consumer dies, it will be able to read back from where it left off thanks to the committed consumer offsets.
- By default Java consumers automatically commit offsets (at least once)
- There are 3 delivery semantics if you choose to commit manually:
- At least once (usually preferred):
- Offsets are committed after the message is processed.
- If the processing goes wrong, the message will be read again.
- This can result in duplicate processing of messages so user should make sure the system won’t be impacted by messages that are processed again.
- At most once:
- Offsets are committed as soon as messages are received.
- If the processing goes wrong, some messages will be lost (they won’t be read again)
- Exactly once:
- Messages are processed only once.