In this article, I am going to demonstrate how to load data into Druid from TSV file, using Druid’s native batch ingestion using TSV ParseSpec. I assume you already have a good understanding of Druid architecture and have Druid installed and running. If not, see my previous post to quickly install and run Druid. TSV ParseSpec TSV ParseSpec has following components: format is a string object and the value should be “tsv”. timestampSpec is a json object and it specifies the column and format of the timestamp. dimensionsSpec is also a json object and it specifies the dimensions of the data. delimiter is […]
Continue ReadingLoading data into Druid from CSV file
In this article, I am going to demonstrate how to load data into Druid from CSV file, using Druid’s native batch ingestion using CSV ParseSpec. I assume you already have a good understanding of Druid architecture and have Druid installed and running. If not, see my previous post to quickly install and run Druid using the imply distribution. CSV ParseSpec Ingesting data into Druid from a CSV file is pretty similar to json data loading. We need to use CSV ParseSpec with the String Parser to load CSV. Strings are parsed using the com.opencsv library. CSV ParseSpec has following components: […]
Continue ReadingIngesting and Querying Multi-Value Dimensions in Druid
Druid supports “multi-value” string dimensions. These are generated when an input field contains an array of values instead of a single value. topN and groupBy queries can group on multi-value dimensions. When grouping on a multi-value dimension, all values from matching rows will be used to generate one group per value. It’s possible for a query to return more groups than there are rows. For example, if you have a dataSource with a segment that contains the following row, with a multi-value dimension called tags.
1
2
|
{"timestamp":
"2018-11-01T00:00:00.000Z",
"tags":
["t1","t2","t3"]} #row1
{"timestamp":
"2018-11-02T00:00:00.000Z",
"tags":
["t3","t4","t5"]} #row2
|
a topN on the dimension tags with filter “t1” AND “t3” would match only row1, and generate a result with three groups: t1, t2, […]
Continue ReadingHandling nested json during ingestion into Druid
In my previous article, I have demonstrated how to perform a batch file load, using Druid’s native batch ingestion. And I have only shown handling of root level elements of json and I have intentionally skipped the nested elements of json. That’s because nested json needs special handling for ingestion into Druid, they need to be flatten first. For this, you can either convert your nested json to a flatten one using a custom application first and then ingest it into druid, or you can use json-flatten-spec. In this post, I will demonstrate the latter one. Json Flatten Spec Defining the […]
Continue ReadingLoading Json data from batch file into Druid
In case of time-series events data in a relational database, stored one event per row, If we need to calculate the number of events per hour, we’d select all rows within an overall interval, group those rows by hour, and count the rows in each hour group. If we have to perform this query many times on the same data, that’s a lot of repeated counting. For immutable events, a better approach would be to do the counting once and store the results, so the work is already done when querying for the data. This is as much as like […]
Continue Reading