Kafka, Absolute Basics


Kafka is Apache’s open-source platform for distributed stream processing in high throughput environments. The following post will briefly go through Kafka and messaging concepts in general.

Kafka is a distributed publish/subscribe fault tolerant messaging system with partitioning and replication abilities that is based on distributed commit logs. It can handle high volumes of message streams (more than 1.5 million messages per second). Kafka heavily depends on the Apache Zookeeper maintain its internal states.

Topic

Topic is a logical concept; a named feed of messages to be exchanged.

Partition

A partition is an immutable sequence of topic messages in a physical log file.

  • A broker could have one or more partitions.
  • One partition cannot be split across machines.

Kafka Message

is an entity with a arrival timestamp, a unique id and the binary data payload.
Technically, there is no such type as ‘Message‘ within the entire API; it is called ProducerRecord which can be instantiated using the required ‘topic‘ and ‘value‘ properties.

  • Topic: The topic to which this record belongs.
  • Key: A convenient property that producer’s default-partitioner can use its hash form to decide and target the dispatching partition.
  • Value: Essentially the message content to be serialized using the specified serializer.
  • Partition: The specific partition (in a broker) to which the record will be dispatched.
  • Timestamp: The 8 bytes timestamp data transmitted with the message.

Producer & consumer

components interested in sending and receiving messages to and from topics.

  • Kafka won’t allow more than 2 consumers to read from the same partition simultaneously.

Broker

A broker is a software process or node by which Kafka manages topics. They can be scaled out to achieve a higher load distribution, throughput and fault tolerance.

Cluster

is a group of brokers (a leader and followers) on one or more hosts

Message Offset

is a non-global zero-based immutable sequence-number of a message within a partition that is maintained by consumer.
enables consumers to operate independently by representing the last read position the consumer has read from a partition within a topic

current position:

the last committed offset: the last record that the consumer has confirmed to have processed
uncommitted offsets: the records between the current position and the last committed offset
log-end offset:

for any given topic, a consumer may have multiple offsets being tracked
one for each partition within a topic

Retention Policy

is Kafka’s configurable period during which the messages are retained by Kafka cluster for a topic.

Replication Factor

is the data redundancy factor by which Kafka replicates messages among brokers.

In-Sync Replicas (ISR)

is the number of in-sync brokers in a topic’s replica-set or quorum.

Controller

is an elected broker within a cluster for carrying out administrative tasks (e.g. managing the states of partitions and replicas).

Partition Leader

is a broker responsible for recruiting workers for replicating and propagating messages.

Follower/Worker

is a broker that only replicates the leader. It’s good to mention that a broker can simultaneously be the leader for one topic a worker for another topic.

Message Serializers/Deserializers

Message serializers define how the key/value pair in a message are to be encoded & decoded by producer and consumer respectively. The producer will then be bound to only dispatching records that match with the configured serializer types.

Deserializers must correspond to the producer’s serializers type

Delivery Acknowledgement

The level of acknowledgement demanded by the producer for its sent messages.

  • Fire and Forget: producer asks for no acknowledgement from any broker (common for lossy messaging scenarios)
  • Leader acknowledged: producer asks only the leader broker to confirm instead of waiting for all quorum members confirmations.
  • Quorum acknowledged: producer asks all in-sync replicas to confirm the receipt (higher performance cost)

Delivery Semantics

There are 3 basic message delivery semantics that can be configured in Kafka.

  • At-most once: for each handed message, it could either be lost or delivered at most one time.
  • At-least once: for each handed message, it shall be delivered at least one time (probable duplicates and multiple attempts)
  • Exactly once: for each handed message, it shall be delivered to the consumer exactly once.

Micro-batching

Micro-batching is the technique of treating a stream as a sequence of small data chunks. Its intention is to reduce processing cost though amortizing the t broker and read-requests at consumer side.

Message Ordering

In Kafka, the message ordering is only maintained within the partition. In order to achieve ordering on the topic (across partitions), the ordering logic needs to be provided at the consumer side since ordering mistakes are inevitable due to unpredictable network issues.

Routing Strategies

This is the strategy based on which, the producer determines to which partition the record shall be routed to.

  • Direct: Direct message dispatch to a partition in case a valid one is targeted.
  • Round-Robin: Equally distributed message dispatch using default partitioner and no key.
  • Key Mod-Hash: A hash-based strategy applied when using a key together with the default partitioner.
  • Custom: A custom based strategy applied when the key and custom partitioner are defined.

Zookeeper

In a nutshell, Zookeeper is a distributed key-value store used for maintaining configuration information, distributed synchronization, group services and etc. for various distributed applications.

Zookeeper Ensemble

Share Comments