About Apache Kafka

by:

Uncategorized

Apache Kafka is a Distributed Stream Processing System.

Basic components of Kafka are :

  • Kafka Broker – the first server users interact with. It listens to some TCP connection to accept connections. Default port for Kafka is 9092
  • Producer – publishes content to the broker
  • Consumer – consumes content from the broker
  • Connections – connections established between Producer and Kafka Broker and between Kafka Broker and Consumer respectively
  • Topics – logical partitions where producers write contents to. When Producer writes data, it has to specify which Topic to write to. Such as Consumer has to specify which Topic to consume data from. As Topics grow large, Kafka applies sharding to topics data. Kafka divides Topics into Partitions
  • Partitions can physically be presented on different servers, this is a horizontal scalability. In this terms, Producers and Consumers have to point not only topic, but also which partition they are publishing/reading data to/from. So Kafka Consumer deals with Partitions and data in Partitions, which are structured in a sequential way

Kafka supports both Queue (Message published once, consumed once) and Pub Sub (Message published once, consumed many times). For this, Kafka presents

  • Consumer Group – Consumer Groups are meant to provide parallel processing on partitions. For instance, if there is a Topic consisting of two Partitions and there is one Consumer, then one Consumer has to work with both Topics. And if there are two Consumers in a Consumer Group, then each of these two can deal with its own Partition (Consumer1 with Partition1/Partition2, Consumer2 with Partition2/Partition1) in parallel. Generally, one Consumer can consume data from more than one Partition, but each Partition provides data for only one Consumer from a Consumer Group (so if there are more than Consumer Group then Partition can provide data for at most one Consumer from each Consumer Group). To act like a Queue, all consumers must be put in one group. Like it was mentioned above, each Partition is tied with only one Consumer, and as soon as some data from Partition is read, it is marked as “read”. To act like a pub/sub, each Consumer has to be put in a unique Group. In this way, each Partition can be consumed by multiple Consumers from different Groups.
  • Distributed System – Kafka usually operates on more than one Kafka Broker, and partitions are duplicated on them. This makes Kafka a durable system of storing data. That is where Kafka Zookeeper comes on stage. The issue is that there is always one Leader and one or more Follower(s) partition(s). Zookeeper assigns Leader partition and follower partitions and determines where to publish the data from Producer first and where to copy them further. Same with Consumer.

Kafka Streams API is a new level of abstraction of working with Kafka. A lot of times developer(s) would like to modify the data (to do so called Extract, Transform and Load). Kafka streams is very useful framework for transforming data. Data from multiple streams can be combined, joined, enriched, and then data gets written out to other streams. Kafka Stream can be thought of as an unbounded, continuous real-time flow of records. Kafka Streams API transforms and enriches data. Kafka Streams API is a part of the open-source Apache Kafka project. It can be implemented via standard Java applications and microservices. Kafka Streams API can be called from Java or Scala application. It is supplied as a jar file and can be declared as Maven or Gradle dependency. Kafka Steams API interacts with a Kafka Cluster. .

Leave a Reply

Your email address will not be published. Required fields are marked *