Druid stores data in segments. Each segment is a single file, typically comprising up to a few million rows of data. Because there is some per-segment memory and processing overhead, it can sometimes be beneficial to reduce the total number of segments. This tutorial demonstrates how to compact existing segments into fewer but larger segments using Druid compaction task.
Once you ingest some data in a dataSource for an interval and create Druid segments, you might want to make changes to the ingested data. For example, if you want to add or remove columns from your existing segments, or you want to change the rollup granularity of your segments, you will have to reindex your data. Kafka Indexing Service may produce a number of segments based on topic partition and granularity configurations. So you need to reindex data to reduce the number of segments. All of these can be done by reindexing the data using Hadoop batch ingestion or native batch ingestion. In this article, I will demonstrate how to reindex data in Druid using the native batch ingestion.
One of the most popular trends in the data world is the stream analytics. Organizations are increasingly striving to build solutions that can provide immediate access to key business intelligence insights through real-time data exploration. Using Apache Kafka and Druid we can easily build an analytics stack that enables immediate exploration and visualization of event data. This tutorial demonstrates how to load data streams from a Kafka topic to Druid, using the Druid Kafka indexing service.
Kafka provided a number of utility tools inside the distribution. The partition reassignment tool can be used to increase the replication factor of a topic. In this article, I am going to discuss about the way to increase topic replication factor using partition reassignment tool. The purpose of adding replication in Kafka is for stronger durability and higher availability. We want to guarantee that any successfully published message will not be lost and can be consumed, even when there are server failures.
Increasing the replication factor of an existing partition is easy. We need to create a custom reassignment json file and use it with the –execute option to increase the replication factor of the specified partitions.
Kafka provided a number of utility tools inside the distribution. In this article, I am going to discuss about some of the most frequently used commands related to Kafka producers and consumers. This tutorial requires you to have Kafka installed. See my previous posts to setup Kafka and Zookeeper, if you haven’t yet installed them. It also requires you to already have a topic created. See my previous post how to create and manage Kafka topics from command line.