scala | HaVeDa - A Backend Development Blog

Reliably utilizing Spark, S3 and Parquet: Everybody says ‘I love you’; not sure they know what that entails

October 29, 2017October 30, 2017 ywilkof5 Comments

Posts over posts have been written about the wonders of Spark and Parquet. How one can simply save the RDD/Dataframes in parquet format into HDFS or S3. In many cases the job output is persisted to HDFS volumes that are located on the same machines in the Spark cluster. However, HDFS come with a price: Disk volume resources…

Reporting Kafka Offsets To Datadog with Quantifind’s KafkaOffsetMonitor

May 14, 2017May 14, 2017 ywilkofLeave a comment

There are multiple solutions on reporting Kafka offset information, such as Kafka distribution’s ConsumerOffsetChecker or Linkedin’s Burrow (which is more advanced and has an alerting and recovery built-in logic). In my company we were already using Quantifind’s KafkaOffsetMonitor. It’s a very useful visualization tool for Kafka topic and their consumers. It visualizes the relation between…

Win over Spark distribution’s dependency conflicts with SBT shading

December 27, 2015April 10, 2016 ywilkof2 Comments

In our production environment we are currently using a spark cluster in standalone mode (version 1.5.2). We reached a point when it was necessary to add the Amazon S3 Java SDK (1.10.39), following the advice given in this post. What should have been an easy task proved to be problematic – the Jackson dependency for…