In our production environment we are currently using a spark cluster in standalone mode (version 1.5.2). We reached a point when it was necessary to add the Amazon S3 Java SDK (1.10.39), following the advice given in this post. What should have been an easy task proved to be problematic – the Jackson dependency for the SDK required a newer version of Jackson library (2.5.3) than the one supplied with the Spark distribution (2.4.4). This caused during the runtime of the Job a MethodNotFoundException from the Jackson library – Though both version were supplied in the classpath, Spark’s outdated Jackson version was eventually taken rather than the one I supplied with the job uber JAR.
That basically left me with a couple of options:
- Downgrade to an older SDK version that uses Jackson 2.4.4: that did not work due to a bug with Java version we were using, which is only solved in a later version.
- Create our own spark distribution which uses the needed Jackson version: possible but tiresome and I assumed our DevOps guys will not be so jolly about having to support our own Spark distibution.
- Set the experimental configuration property spark.driver.userClassPathFirst to true: that will allow to run the job with job provided jars with a priority over the ones supplied with Spark. I stumbled in the documentation upon this feature which sounded quite fitting to the problem- so I tried it. Alas, my uber-JAR was providing more dependencies than I intended it to. Exceptions due to new conflicts were popping out from collisions with other dependencies. Having experienced that in addition to the ominous ‘experimental’ prefix convinced me this was not the way to go.
Then I researched a bit more as to conflict issues and came to learn about shading dependencies during the assembly stage. I heard about it before but never given it much thought. For those of you who also have not heard about it – you can read about it in the official maven documentation. Fortunately Scala SBT now has a shading feature as well.
So by upgrading our assembly plugin
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.1")
and adding this rule to our SBT
// There is a conflict between fasterxml needed by aws(2.5.3) to the one that comes with spark JAR (2.4) assemblyShadeRules in assembly := Seq( ShadeRule.rename("com.fasterxml.**" -> "shadeio.@1").inAll )
I was able to shade the necessary version of Jackson within our Job while Spark used its unshaded Jackson version, allowing each one to live in peace with its own requirements. This solution can obviously be generalized to any dependency collision. With Spark developing forward (but leaving its dependencies behind), it is quite likely that other projects might end up using shading as well.
By the way, for those of you using Holdenk’s neat spark testing base you must exclude the colliding dependencies manually from the testing base dependency as well. This is easy, since these are solved at assembly time:
libraryDependencies += "com.holdenkarau" % "spark-testing-base_2.10" % "1.5.1_0.2.1" % "test" exclude ("com.fasterxml.jackson.core","jackson-annotations") exclude ("com.fasterxml.jackson.core","jackson-core") exclude ("com.fasterxml.jackson.core","jackson-databind")