Druid, what the diff? Using Apache Druid to report top changes in a metric between two intervals

At our company we are using Druid to power our time-based (as well as our non time-based) analytics dashboard. We looked into several options such as a relational database (Postgres) , Clickhouse and Elasticsearch. Taking into account the pros and cons of each,  we finally decided for Druid, as it was the leader in most query benchmarks. In fact,…

Reliably utilizing Spark, S3 and Parquet: Everybody says ‘I love you’; not sure they know what that entails

Posts over posts have been written about the wonders of Spark and Parquet. How one can simply save the RDD/Dataframes in parquet format into HDFS or S3. In many cases the job output is persisted to HDFS volumes that are located on the same machines in the Spark cluster. However, HDFS come with a price: Disk volume resources…

Reporting Kafka Offsets To Datadog with Quantifind’s KafkaOffsetMonitor

There are multiple solutions on reporting Kafka offset information, such as Kafka distribution’s ConsumerOffsetChecker or Linkedin’s Burrow (which is more advanced and has an alerting and recovery built-in logic). In my company we were already using Quantifind’s KafkaOffsetMonitor. It’s a very useful visualization tool for Kafka topic and their consumers. It visualizes the relation between…

Using MySQL tables for aggregations : lnitialize your counters before you go-go

So far I had a positive experience using MySQL for holding results of aggregations. Firstly, it provides flexible and efficient indexing on the dimension columns. For me, it was a great improvement over Cassandra’s in-built counter tables (we were using v2.0.7), which do not offer much flexibility on querying dimensions independently due to its key based partitioning . Secondly,  MySQL provides…