In our production environment we are currently using a spark cluster in standalone mode (version 1.5.2). Coming from the world of Hadoop map-reduce I find it a big improvement over the former.
However, one of the surprising things about the fast-growing framework is the in my opinion unclear design choice when it comes to submitting actual jobs to the cluster. In short, you can do it by executing a shell script supplied with the Spark bundle, while supplying some arguments necessary for the run, such as the location of the JAR that has the job code, the cluster’s master host address, etc…
So, on the plus side, this shell script can be run in order to submit from any machine to any other machine, but on the minus side – one must have the shell script installed somewhere, possibly on a machine outside of the spark cluster, that does not have anything to do with Spark.
In our project we needed to trigger spark jobs with various Cron triggers that were triggered from our central management platform (which has nothing to do with out Spark data-processing cluster). We were looking for an easy way to integrate with our spark cluster and submit our jobs periodically. In comes Spark’s internal Rest API, which I read about here for the first time. In a nutshell, Spark comes with an internal (yet interactable) REST server on your spark master.
In order to simplify its usage I wrote a small client around the calls to this server. In the documentation I explain how to use it. In short, one needs to create a client which can be used (potentially) by the whole application while issuing multiple requests in parallel. Once a client has been created, it can be used to issue many requests to submit, check the status and kill jobs. One thing to remember is that the REST API might be subject to changes by the Spark team, since it’s still not officially released. I will incorporate changes as necessary. The client can be downloaded from Maven Central.