Druid is designed to be deployed as a horizontally scalable, fault-tolerant cluster. In this post, we’ll set up a simple cluster and discuss how it can be further configured to meet your needs. This simple cluster will feature scalable, fault-tolerant Data servers for ingesting and storing data, a single Query server, and a single Master server. Later, we’ll discuss how this simple cluster can be configured for high availability and to scale out all server types. We will use imply druid distribution.
Imply cluster has 3 server types:
- Master server(s) running a Druid Coordinator, Druid Overlord, Metadata Storage (Derby), ZooKeeper.
- Data servers running Druid Historical Nodes and Druid MiddleManagers.
- Query servers running Druid Brokers and Imply Pivot.
The Master server coordinates data ingestion and storage in your Druid cluster. It is not involved in queries. It is responsible for starting new ingestion jobs and for handling failover of the Druid Historical Node and Druid MiddleManager processes running on your Data servers. Master servers can be deployed standalone, or in a highly-available configuration with failover. For failover-based configurations, it is recommended to separate ZooKeeper and the metadata store into their own hardware.
Data servers store and ingest data. Data servers run Druid Historical Nodes for storage and processing of large amounts of immutable data, Druid MiddleManagers for ingestion and processing of data, and optionally Tranquility components to assist in streaming data ingestion. For clusters with complex resource allocation needs, you can break apart the pre-packaged Data server and scale the components individually. This allows you to scale Druid Historical Nodes independently of Druid MiddleManagers, as well as eliminate the possibility of resource contention between historical workloads and real-time workloads.
Query servers are the endpoints that users and client applications interact with. Query servers run a Druid Broker that route queries to the appropriate data nodes. They also include an Imply Pivot server as a way to directly explore and visualize your data.
For this simple cluster, you will need one Master server, one Query server, and as many Data servers as necessary to index and store your data.
Download the distribution
First, download Imply 2.5.16 from imply.io/get-started and unpack the release archive. It’s best to do this on a single machine at first, since you will be editing the configurations and then copying the modified distribution out to all of your servers.
tar -xzf imply-2.5.16.tar.gz cd imply-2.5.16
In this package, you’ll find:
bin/ – run scripts for included software.
conf/ – template configurations for a clustered setup.
conf-quickstart/* – configurations for the single-machine quickstart.
dist/ – all included software.
quickstart/ – files related to the single-machine quickstart.
We’ll be editing the files in conf/ in order to get things running.
Configure Master server address
In this simple cluster, you will deploy a single Master server running a Druid Coordinator, a Druid Overlord, a ZooKeeper server, and an embedded Derby metadata store.
In conf/druid/_common/common.runtime.properties, update these properties by replacing “master.example.com” with the IP address of the machine that you will use as your Master server:
Or you can just replace “master.example.com” with “master” in these properties and then you can add the hostname and aliases of the storage system in the
/etc/hosts files of all three servers.
Configure deep storage
Druid relies on a distributed filesystem or binary object store for data storage. The most commonly used deep storage implementations are S3 (popular for those on AWS) and HDFS (popular if you already have a Hadoop deployment).
Add “druid-s3-extensions” to
druid.extensions.loadList. If for example the list already contains “druid-parser-route”, the final property should look like:
Comment out the configurations for local storage under “Deep Storage” and “Indexing service logs”.
Uncomment and configure appropriate values in the “For S3” sections of “Deep Storage” and “Indexing service logs”.
After this, you should have made the following changes:
druid.extensions.loadList=["druid-parser-route", "druid-s3-extensions"] #druid.storage.type=local #druid.storage.storageDirectory=var/druid/segments druid.storage.type=s3 druid.storage.bucket=your-bucket druid.storage.baseKey=druid/segments druid.s3.accessKey=... druid.s3.secretKey=... #druid.indexer.logs.type=file #druid.indexer.logs.directory=var/druid/indexing-logs druid.indexer.logs.type=s3 druid.indexer.logs.s3Bucket=your-bucket druid.indexer.logs.s3Prefix=druid/indexing-logs
Start Master server
Copy the Imply distribution and your edited configurations to your new Master server. If you have been editing the configurations on your local machine, you can use rsync to copy them:
rsync -az imply-2.5.16/ MASTER_SERVER:imply-2.5.16/
On your Master server, cd into the distribution and run this command to start a Master:
bin/supervise -c conf/supervise/master-with-zk.conf
You should see a log message printed out for each service that starts up. You can view detailed logs for any service by looking in the
var/sv/ directory using another terminal.
Start Query server
Copy the Imply distribution and your edited configurations to your Query servers. On each one, cd into the distribution and run this command to start a Query server:
bin/supervise -c conf/supervise/query.conf
The default Query server configuration launches a Druid Broker and Imply UI.
Start Data servers
Copy the Imply distribution and your edited configurations to your Data servers. On each one, cd into the distribution and run this command to start a Data server:
bin/supervise -c conf/supervise/data.conf
The default Data server configuration launches a Druid Historical Node, a Druid MiddleManager, and optionally Tranquility components. New Data servers will automatically join the existing cluster. These services can be scaled out as much as necessary simply by starting more Data servers.
High availability & Fault-tolerant
The cluster you just created runs a single Master server, a single Query server, and can run multiple Data servers. This supports scale-out fault tolerant Data servers out of the box, but some more configuration is necessary to achieve the same for Master and Query servers.
For the Master server, which runs Derby (metadata storage), ZooKeeper, and the Druid Coordinator and Overlord:
- For highly-available ZooKeeper, you will need a cluster of 3 or 5 ZooKeeper nodes. It is recommended to install ZooKeeper on its own hardware.
- For highly-available metadata storage it is recommended to use PostgreSQL or MySQL with replication and failover enabled. Sample, commented-out Druid configurations for both are included in common.runtime.properties in the Imply distribution.
- Configuring highly-available Druid Coordinators and Overlords is simple: just start up multiple servers. If they are all configured to use the same ZooKeeper cluster and metadata storage, then they will automatically failover between each other as necessary. Only one will be active at a time, but inactive servers will redirect to the currently active server.
For the Query server:
- Druid Brokers can be scaled out and all running servers will be active and queryable. It is recommended to place them behind a load balancer.
Data servers can be scaled out without any additional configuration.
Congratulations, you now have an Imply druid cluster! Once things are up and running, you can hit the cluster console at “http://MASTER_SERVER:8081/” and coordinator console at “http://MASTER_SERVER:8090/console.html” from your browser.
See the documentations for more details.