18/05/2024 - KAFKA
Apache Kafka is a distributed event store and streaming platform. In this post I'll do my best to explain Kafka, its components and some tradeoffs without getting into too much details.
Zookeeper is used for coordination and tracking the status of Kafka cluster nodes. It also keeps track of Kafka topics, partitions, offsets so on.
This is not unique to Kafka hence it is rather a common terminology in the distributed computing system. It is just a group of servers that work towards a common purpose. Given Kafka is a distributed system, its cluster can have a group of servers which are called brokers. There are two types of clusters which are Single Broker Cluster and Multi-Broker Cluster.
A Kafka broker is a server that runs the Kafka software and stores data. In a Kafka cluster, there can be one or more brokers. It is responsibility for event exchange between a producer and a consumer. For Kafka producer, it acts as a event receiver and for Kafka consumer, it acts as a event sender.
Events are a significant change in state. They are triggered when "something" happens in the system. When you read or write data to Kafka, you do this in the form of events. Conceptually, an event has a key, value, timestamp and optional metadata headers.
They represent "categories/groups" under which data or events are stored. A Kafka Topic is a unique name given to a data stream or event stream. For example, "alarms".
A Producer is an entity whose responsibility is to publish/send (write) events to Kafka. This is your program. The event is always sent to selected topic.
A Consumer is an entity whose responsibility is to subscribe/receive (read) events from Kafka and process them as per their business logic. This is your program. A consumer can subscribe to one or more topics to receive the events.
It represents group of consumers where multiple consumers are combined to share the workload between. It is possible that multiple consumer groups subscribing to the same or different topics. Two or more consumers in same consumer group do not receive the same message. Each consumer always receive a different message so no duplicate processing. This is because the offset moves to the next number as soon as the message is consumed by any of the consumers in that consumer group.
Each topic in Kafka can be divided into partitions. Short version, partitions as segments within a topic which enable Kafka to manage data more efficiently. When a producer writes a data to the broker using a topic, broker stores the data linked to that topic. In case of a massive set of data, broker would struggle to store it in a single machine. Given Kafka is distributes system, we can benefit from it by breaking the topic in partitions and distribute the partitions on a different machine to store data. The number of partitions for a topic is defined at topic creation.
In each partition of a Kafka topic, there is a sequence number (Offset) is assigned to each message. As soon as a new message arrives in a partition, a sequence number is assigned to the message. For any given topic, each partitions would have different offsets. This offset number is local to the topic partition so no concept of global applies here. Based on this, if you wish to locate/identify a message, you can construct name using "Topic Name - Partition No - Offset No".
In Kafka, replication means that data is written down not just to one broker, but many. The replication factor is a topic setting and is specified at topic creation. A replication factor of 1 means no replication.
This is just a very very short list but contains some "must-know" ones. You can use this visualisation tool if you wish.
retention.ms
and delete.retention.ms
settings are configured at topic creation. Actual deletion kicks in slightly longer than the specified duration. e.g. (retention ms + ~5 minutes)sarama.ProducerMessage.Key
field. Which partition the message would go is decided based on the hashing (HashPartitioner
) of the message key. Just bear in mind, if the key variation is low but the partition count is high, it is highly likely be possible that the most of the partitions won't get any messages. e.g. Given there are 6 key variations and 30 partitions, only 6 partitions would get messages which beats the purpose of partitions. When not sure, avoid setting sarama.ProducerMessage.Key
to allow RandomPartitioner
handle message allocation on partitions. This would at least evenly distribute messages and improve consumption performance.A message will definitely be written once but may possibly written more than once too.
A message will definitely be written once at max but may possibly not written at all too.
A message will definitely be written once.
isolation_level=1
- ref. If you don't do this, consumer will consume uncommitted and aborted messages.acks=all
- ref.enable.idempotence=true
- ref.auto.create.topics.enable=false
- ref. Handle the process manually by a dedicated service.// List all topics
$ kafka-topics.sh --bootstrap-server localhost:9092 --list
errors-topic
warnings-topic
// List details of a topic
$ kafka-topics.sh --bootstrap-server localhost:9092 --describe --topic errors-topic
Topic: errors-topic TopicId: EMvnZ6AXQ1O50G4yh4dLDA PartitionCount: 3 ReplicationFactor: 1 Configs: retention.ms=3600000,delete.retention.ms=3600000
Topic: errors-topic Partition: 0 Leader: 1 Replicas: 1 Isr: 1
Topic: errors-topic Partition: 1 Leader: 1 Replicas: 1 Isr: 1
Topic: errors-topic Partition: 2 Leader: 1 Replicas: 1 Isr: 1
// List all consumer groups
$ kafka-consumer-groups.sh --bootstrap-server localhost:9092 --list
errors-topic-group
warnings-topic-group
// List details of consumer group
$ kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group errors-topic-group
GROUP TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID HOST CLIENT-ID
errors-topic-group errors-topic 0 1 1 0 inanzzz-server-2e6992aa-107b-4eb4-82df-32478a7b9478 /172.31.0.1 inanzzz-server
errors-topic-group errors-topic 1 - 0 - inanzzz-server-2e6992aa-107b-4eb4-82df-32478a7b9478 /172.31.0.1 inanzzz-server
errors-topic-group errors-topic 2 - 0 - inanzzz-server-2e6992aa-107b-4eb4-82df-32478a7b9478 /172.31.0.1 inanzzz-server
// Purge messages in a topic by changing retention from 1 hour (3600000) to 1 second (1000). You should revert it afterwards!
$ kafka-configs.sh --bootstrap-server localhost:9092 --topic errors-topic --alter --add-config retention.ms=1000
Completed updating config for topic errors-topic.