
Apache Kafka® is a distributed streaming platform. It is used for building real-time data pipelines and streaming apps. It is horizontally scalable, fault-tolerant, wicked fast, and runs in production in thousands of companies.
Kafka has three key capabilities:
- Publish and subscribe to streams of records, similar to a message queue or enterprise messaging system.
- Store streams of records in a fault-tolerant durable way.
- Process streams of records as they occur.
Kafka is generally used for two broad classes of applications:
- Building real-time streaming data pipelines that reliably get data between systems or applications
- Building real-time streaming applications that transform or react to the streams of data
In this post, I will discuss about setting up a 3 nodes Kafka cluster.
Download Distribution
Download the 1.1.0 release and un-tar it.
$ tar -xzf kafka_2.11-1.1.0.tgz $ cd kafka_2.11-1.1.0
Configuration
Pre-requisite
Both Kafka & ZooKeeper runs in Java (JDK 8 or greater). I’m using java 8. Have your preferred java version installed and set your java home correctly. You can do this easily using SDKMAN! Read my post here about SDKMAN to learn how to smartly manage SDKs using SDKMAN!
Zookeeper
Kafka uses ZooKeeper so you need to first start a ZooKeeper server if you don’t already have one. I have already discussed how to setup zookeeper cluster in my previous post. You can follow that to set up your own Zookeeper cluster. Or you can use the one that comes with Apache Kafka distribution. I will discuss here about setting up Zookeeper cluster from Kafka distribution.
Set hostname zk01, zk02, zk03 for 3 nodes. On each node, set an environment variable ZK_HOME
where you have extracted the kafka distribution.
zk-cluster.sh
with following:#!/bin/sh MY_ID=$1 mkdir -p $ZK_HOME/var/zk/data mkdir -p $ZK_HOME/var/zk/log mkdir -p $ZK_HOME/conf cat >$ZK_HOME/conf/zk.properties << EOF dataDir=$ZK_HOME/var/zk/data clientPort=2181 maxClientCnxns=0 server.1=zk01:2888:3888 server.2=zk02:2888:3888 server.3=zk03:2888:3888 initLimit=5 syncLimit=2 autopurge.snapRetainCount=3 autopurge.purgeInterval=1 EOF cat >$ZK_HOME/var/zk/data/myid << EOF ${MY_ID} EOF
On node zk01, run:
./zk-cluster.sh 1
On node zk02, run:
./zk-cluster.sh 2
On node zk03, run:
./zk-cluster.sh 3
Now start zookeeper on each node by running:
$ screen -S zk $ cd $ZK_HOME $ ./bin/zookeeper-server-start.sh conf/zk.properties
Kafka
Set hostname kafka01, kafka02, kafka03 for 3 nodes. On each node, set an environment variable KAFKA_HOME
with path where you have extracted the kafka distribution.
Create a file named kafka-cluster.sh
with following content:
create kafka-cluster with following: #!/bin/sh BROKER_ID=$1 HOST=$2 mkdir -p $KAFKA_HOME/var/kafka/data mkdir -p $KAFKA_HOME/var/kafka/log mkdir -p $KAFKA_HOME/conf cat >$KAFKA_HOME/conf/kafka.properties << EOF broker.id=${BROKER_ID} listeners=PLAINTEXT://${HOST}:9092 advertised.listeners=PLAINTEXT://${HOST}:9092 num.network.threads=3 num.io.threads=8 socket.send.buffer.bytes=102400 socket.receive.buffer.bytes=102400 socket.request.max.bytes=104857600 log.dirs=$KAFKA_HOME/var/kafka/log num.partitions=1 num.recovery.threads.per.data.dir=1 log.retention.hours=168 log.segment.bytes=1073741824 log.retention.check.interval.ms=300000 zookeeper.connect=zk01:2181,zk02:2181,zk03:2181 zookeeper.connection.timeout.ms=6000 EOF
On node kafka01, run:
./kafka-cluster.sh 1 kafka01
On node kafka02, run:
./kafka-cluster.sh 2 kafka02
On node kafka03, run:
./kafka-cluster.sh 3 kafka03
Now start kafka on each node by running:
$ screen -S kafka $ cd $KAFKA_HOME $ ./bin/kafka-server-start.sh conf/kafka.properties
Congratulations! You are done with setting up your own 3 node Kafka cluster.
Conclusion
Kafka is the most popular streaming platform and it is being used widely among organizations. In this post, I have discussed about setting up a Kafka cluster. Thanks for reading and feel free to comment or share the article.