Much had been said about the hardships entailed in combining of Apache Spark, Hadoop libraries and Amazon’s AWS SDK. Take as an example reading from S3 Storage using s3a:// file system. If you’ve tried once this setup, then you know it is not a straight forward task. While it seems like this should work out of the box – it doesn’t. Although, why should you care about using S3a though? Well, for two reasons: Firstly, S3n is deprecated and will be eventually dropped. Nobody wants to be dependent on deprecated software. Secondly – and more importantly – Using S3n for reading from/writing to S3 has been proven to be slow. On the other hand, S3a should be faster than S3n by magnitudes. So just by upgrading to the new file system one can shorten runtime of Spark jobs dramatically. In the Hadoop documentation this is further explained.
So, I hope that by now you are convinced of the importance of the switch from S3n to S3a.
In this thread and this post it is discussed how can one achieve the golden combination of libraries that will allow to use S3a file system without Spark yielding all kinds of unexpected RuntimeExceptions; a rather colorful and annoying variety of NoClassDefFoundError and NoSuchMethodError errors.
However, in the above suggested solutions one must compromise and either use Hadoop 3 (still in Alpha) or use a very old AWS SDK jar. I wanted to use a stable Hadoop version that uses a more modern SDK dependency. I was able to achieve a working installation of Spark (2.0.2), Hadoop (2.7.3) and AWS SDK (1.10.77).
Steps to acheive this:
– Package a hadoop-aws version with an updated AWS SDK dependency – requires signature changes in the original code. Then add the Jar to the classpath. I uploaded my version of this to github, which can just be cloned and sbt packaged. This Jar does not include S3 and S3n which will, as mentioned before, be eventually deprecated.
– Download and add aws-sdk-s3 and aws-sdk-core to the classpath – they are necessary to interact with AWS.
– Remove old aws-sdk from classpath as to not conflict with the Jars added in previous step.(it’s bundled with Spark in the jars/ folder)
I usually add the jars to the jars/ folder in the Spark distribution – as this folder gets scanned automatically by Spark upon startup.
The following Spark configurations should be set either by a property file or directly in code:
val conf = new SparkConf() .set("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")</code> val sc = new SparkContext(conf) sc.hadoopConfiguration.set("fs.s3a.access.key", key) sc.hadoopConfiguration.set("fs.s3a.secret.key", secret)
It’s worth noting that I found that when executing the jobs on an EC2 instance the S3a access key and secret key configs are ignored and instead is the IAM profile of the machine used, while on a regular non EC2 machine the configs are actually picked up.
If all setup steps were executed correctly, you should be able to use Spark native textFile() method using an S3a:// FS rather the S3n://. I am usually using Parquet formatted data over S3, for which I can confirm it works as well.
My future steps would be to reach a working setup with Hadoop 2.8 and possibly a newer AWS SDK version working with Spark, as each one brings many essential fixes and performance improvements which could have a significant effect on one’s batch processing pipeline.