Druid is an open-source data store designed for sub-second queries on real-time and historical data. It is primarily used for business intelligence/OLAP queries on event data. Druid enables arbitrary data exploration, low latency data ingestion, and fast aggregations at scale. Druid can scale to store trillion of events and ingest millions of events per second. Druid is best used to power user-facing data applications. In this post, we will download Druid, set up it up on a single machine, load some data, and query the data. To install druid, we will use the druid distribution by Imply.
The easiest way to evaluate Imply is to install it on a single machine. In this post, we’ll set up the platform locally, load some example data, and visualize the data. Installing Imply on-premise offers several advantages over stock Druid:
- Imply includes a tested, stable release of Druid.
- Imply includes scripts to easily start and supervise servers and assist with ingesting data.
- Imply includes an interactive interface for exploring data and a SQL workbench for issuing Druid SQL queries.
- Imply includes a data loader interface for easily adding new datasets to Druid.
- Imply packaging enables less resources to be used for POCs and small production deployments.
You will need:
- Java 8
- Linux, Mac OS X, or other Unix-like OS (Windows is not supported)
- At least 4GB of RAM
First, download Imply latest stable version (currently 2.5.16) from imply.io/get-started and unpack the release archive.
tar -xzf imply-2.5.16.tar.gz cd imply-2.5.16
In this package, you’ll find:
bin/*– run scripts for included software.
conf/*– template configurations for a clustered setup.
conf-quickstart/*– configurations for this quickstart.
dist/*– all included software.
quickstart/*– files useful for this quickstart.
Start up services
bin/supervise -c conf/supervise/quickstart.conf
You should see a log message printed out for each service that starts up. You can view detailed logs for any service by looking in the
var/sv/ directory using another terminal.
Later on, if you’d like to stop the services, CTRL-C the supervise program in your terminal. If you want a clean start after stopping the services, remove the
Congratulations, now it’s time to load data!
quickstart directory includes a sample dataset and an ingestion spec to process the data, named
To submit an indexing job to Druid for this ingestion spec, run the following command from your Imply directory:
bin/post-index-task --file quickstart/wikipedia-index.json
A successful run will generate logs similar to the following:
Beginning indexing data for wikipedia Task started: index_wikipedia_2017-12-05T03:22:28.612Z Task log: http://localhost:8090/druid/indexer/v1/task/index_wikipedia_2017-12-05T03:22:28.612Z/log Task status: http://localhost:8090/druid/indexer/v1/task/index_wikipedia_2017-12-05T03:22:28.612Z/status Task index_wikipedia_2017-12-05T03:22:28.612Z still running... Task index_wikipedia_2017-12-05T03:22:28.612Z still running... Task finished with status: SUCCESS Completed indexing data for wikipedia. Now loading indexed data onto the cluster... wikipedia is 0.0% finished loading... wikipedia is 0.0% finished loading... wikipedia is 0.0% finished loading... wikipedia loading complete! You may now query your data
After the dataset has been created, you can move on to the next step to visualize data.
To access Imply, go to http://localhost:9095.
Imply’s data cubes are highly configurable and gives you the flexibility to represent your dataset as well as derived and custom columns in many different ways.
Now switch to the Visualize section of Imply by clicking on the corresponding button on the top bar. From here, you can create data cubes to model your data, explore these cubes, and organize views into dashboards. Start by clicking + Create new data cube.
In the dialog that comes up, make sure that
wikipedia is the selected Source and that Auto-fill dimensions and measures is selected. Continue by clicking Next: Create data cube.
From here you can configure the various aspects of your data cube including defining and customizing the cube’s dimensions and measures. The data cube creation flow can intelligently inspect the columns in your data source and determine possible dimensions and measures automatically. We enabled this when we selected Auto-fill dimensions and measures on the previous screen and you can see that the cube’s settings have been largely pre-populated. In our case, the suggestions are appropriate so we can continue by clicking on the Save button in the top-right corner.
After clicking Save, the data cube view for this new data cube is automatically loaded. In the future, this view can also be loaded by clicking on the name of the data cube (in this example ‘Wikipedia’) from the Visualize screen.
Here, you can explore a dataset by filtering and splitting it across any dimension. For each filtered split of your data, you will see the aggregate value of your selected measures.
The data cube view suggests different visualizations based on how you split your data. If you split on a string column, your data will initially be presented as a table. If you split on time, the data cube view will recommend a timeseries plot, and if you split on a numeric column you will get a bar chart. You can also change the visualization manually by choosing your preferred visualization from the dropdown. If the shown dimensions are not appropriate for a particular visualization, the data cube view will recommend alternative dimensions you can show.
Imply includes an easy-to-use interface for issuing Druid SQL queries. To access the SQL editor, go to the Run SQLsection. If you are in the visualization view, you can navigate to this screen by selecting Run SQL from the hamburger menu in the top-left corner of the page. Once there, try running the following query, which will return the most edited Wikipedia pages:
SELECT page, COUNT(*) AS Edits FROM wikipedia WHERE "__time" BETWEEN TIMESTAMP '2016-06-27 00:00:00' AND TIMESTAMP '2016-06-28 00:00:00' GROUP BY page ORDER BY Edits DESC LIMIT 5
You should see results like the following:
Congratulations! You have now installed and run Imply on a single machine, loaded a sample dataset into Druid, defined a data cube, explored some simple visualizations, and executed queries using Druid SQL.
For more information see: