At our company we are using Druid to power our time-based (as well as our non time-based) analytics dashboard. We looked into several options such as a relational database (Postgres) , Clickhouse and Elasticsearch. Taking into account the pros and cons of each, we finally decided for Druid, as it was the leader in most query benchmarks. In fact, these benchmarks are worth a future post of their own.
One of the features we needed to support was top absolute changes in a certain metric between two different time frames. For example, which country had the biggest change (meaning, biggest increase or drop) in traffic when comparing today to yesterday. Supposedly an elementary feature, however not supported by Druid natively despite of it being a time based datasource.
In this post I would like to share the solution of how this was achieved. The idea is simple: Create two new synthetic metrics representing two condition sums. The first is the sum of the metric for for all rows falling within the first period and the second is the sum of the metric for all rows falling within second period. Afterwards add a post-aggregator, which creates a third synthetic metric: the absolute value of subtraction between the two new metrics. We can now use the topN query on a metric which is the difference between the metrics’ sum in two time intervals, as per the original requirement.
- queryType: “topN” is the type of query that will breakdown our data by a single dimension and return the top results broken down by its values.
- dimension: “some_dim” is the dimension by which we want to breakdown the result
- metric: “diff” is indicates for Druid to perform the sorting of top results based on our the post-aggregator “diff” which will be created in the postAggregator object.
- granularity: “all” is important, as we want to get the two intervals’ metrics in a single interval bucket spanning the whole time frame, as post-aggregators are only possible on metrics returned in the same interval bucket.
- “fieldNames”: dimensions/metrics from datasource which will be exposed to the function
- “__time” is the dimension corresponding with the row’s time
- “fnAggregate”: this is the meat of the query. It basically implements conditional logic for summing up the row’s revenue. ‘current’ is auto injected by Druid and is the total value summed until now. Now comes the conditional clause: If the row’s time falls within the milliseconds representations of the first interval, add to ‘current’ the metric’s value, otherwise add to ‘current’ zero. Notice we converted to the Unix millisecond representations of 2018-09-06/2018-09-07/2018-09-08
- “fnCombine”: simple arithmetic addition
- “funReset”: the initial value, in our case zero.
- “fieldNames”: the names of the two previosly created aggregators.
- “intervals”: we need to pass the two intervals, which correspond to the intervals we want to see the changes for.
- This solution works easy for any Sum Aggregator, but to allow this for HyperUnique will be more implementation work, as merging is not as simple as adding numbers.