Submitting to a Spark Cluster

Configuration

Let's start by altering the sbt configuration so that you can do submissions to the cluster. Note that some of these changes will prevent you from running your program using runMain in sbt, though you can still run it in Eclipse with a local master (for reasons that I don't understand).

    1. Add % "provided" to the end of all of your library dependencies for Spark. This prevents conflicts between the Spark you compile with the Spark on the master. Note, this is what breaks the use of runMain, so you can undo it if you want to use runMain.

    2. Include sbt-assembly. This just requires putting one line in one file in the project directory.

Handling Master

Previously, we included a call to sc.setMaster in our code. This locks you into a specific master. When submitting, you often want to leave that off, and specify the master from the command line when you run. I have started a cluster running on the Pandora machines with the master on Pandora00. You can got to http://pandora00:8080/ to see the status of the cluster.

The specify that you want to use this cluster, make the master equal to spark://pandora00:7077. You can specify this in your code, or with the --master option of spark-shell or spark-submit.

Making a Fat JAR

In order to submit to a server, you need to package all of your code and the libraries it depends on into a "fat JAR". The normal package command makes a JAR file that just contains the compiled class files for your code. The reason for including sbt-assembly is that it provides another command, assembly, that produces a JAR file that includes not only your compiled code, but also the compiled code for all the libraries you use. That is why it is called a fat JAR. For submitting to a server, you actually don't want this to include the Spark libraries, which is why we added "provided" to those libraries in the build.sbt.

So in sbt, run the "assembly" command and you will see it print out some stuff saying it is building the fat JAR for the project. It will also tell you the name of the fat JAR, which you need to know for the next step.

spark-submit

Finally, you are ready to submit. spark-submit is the main program that you use to submit jobs to clusters in Spark. You can see a full page of documentation at https://spark.apache.org/docs/latest/submitting-applications.html. You will need to tell it the object to run with the --class option. If you didn't specify the master in your code, you need to specify it here with --master. The last argument is the name of the JAR your code is in. For example, I submitted my in-class code project with the following command.

spark-submit --class HelloSpark --master spark://pandora00:7077 target/scala-2.11/CSCI3395-F17-InClass-assembly-0.1.0-SNAPSHOT.jar

Graphics/Plotting

In my experiments, the ScalaFX based graphics just work. The SwiftVis2 plots I have in the in-class code pop up on my screen when I ran the above command on Xena00.