Reliably utilizing Spark, S3 and Parquet: Everybody says ‘I love you’; not sure they know what that entails

Posts over posts have been written about the wonders of Spark and Parquet. How one can simply save the RDD/Dataframes in parquet format into HDFS or S3. In many cases the job output is persisted to HDFS volumes that are located on the same machines in the Spark cluster. However, HDFS come with a price: Disk volume resources…

Submit spark jobs easily programmatically via REST: Apache Spark’s little secret is now being served to you gift-wrapped

In our production environment we are currently using a spark cluster in standalone mode (version 1.5.2). Coming from the world of Hadoop map-reduce I find it a big improvement over the former. However, one of the surprising things about the fast-growing framework is the in my opinion unclear design choice when it comes to submitting…