java | HaVeDa - A Backend Development Blog

Reliably utilizing Spark, S3 and Parquet: Everybody says ‘I love you’; not sure they know what that entails

October 29, 2017October 30, 2017 ywilkof5 Comments

Posts over posts have been written about the wonders of Spark and Parquet. How one can simply save the RDD/Dataframes in parquet format into HDFS or S3. In many cases the job output is persisted to HDFS volumes that are located on the same machines in the Spark cluster. However, HDFS come with a price: Disk volume resources…

Out of the Middle Ages: Use S3a File System with Spark (2.x), Hadoop (2.7.x) and AWS SDK (>1.7.4)

August 13, 2017August 15, 2017 ywilkof1 Comment

Much had been said about the hardships entailed in combining of Apache Spark, Hadoop libraries and Amazon’s AWS SDK. Take as an example reading from S3 Storage using s3a:// file system. If you’ve tried once this setup, then you know it is not a straight forward task. While it seems like this should work out…

Propagating environment variables when using Spark’s internal REST API to submit jobs

January 30, 2016February 22, 2016 ywilkofLeave a comment

In an earlier post I introduced my client that wraps the calls to Spark’s REST API to submit jobs, instead of using the spark-submit script. This REST API, while not officially in the Spark documentation, works just fine and almost the same way as the spark-submit script and we are even using it in production. Lately,…

Submit spark jobs easily programmatically via REST: Apache Spark’s little secret is now being served to you gift-wrapped

December 12, 2015April 15, 2016 ywilkof1 Comment

In our production environment we are currently using a spark cluster in standalone mode (version 1.5.2). Coming from the world of Hadoop map-reduce I find it a big improvement over the former. However, one of the surprising things about the fast-growing framework is the in my opinion unclear design choice when it comes to submitting…

Multiple Flyway-initialized data sources in a Spring Boot app – It’s (kinda) possible!

December 3, 2015February 22, 2016 ywilkof2 Comments

I love working with Spring Boot. I am working with it for about six months now, and yet it never ceases to amaze me with the amount of features bundled with it. However, sometimes it tries to a good thing, but misses eventually. Let’s take the Flyway support for example. Flyway is used for easily…