Running Multi-Node Spark with Singularity on Greene

Make a directory within your home directory for your analysis (mkdir ~/myanalysis).  This directory will also be used by Spark to write config files / logs to.

Navigate to that directory using Linux's cd command.

Copy the example Spark submission script  to your analysis directory:

cp /scratch/work/public/apps/pyspark/3.1.2/examples/spark/cluster/run-spark.bash .

 Copy the example SLURM submission script to your analysis directory:

cp /scratch/work/public/apps/pyspark/3.1.2/examples/spark/cluster/run-spark-singularity.sbatch .

 By default run-spark.bash looks like the following:

#!/bin/bash
source /scratch/work/public/apps/pyspark/3.1.2/scripts/spark-setup-slurm.bash
start_all
spark-submit \    --master $SPARK_URL \    --executor-memory $MEMORY \    wordcount.py /scratch/work/public/apps/pyspark/3.1.2/examples/shakespeare-8G.txt
stop_all

First, a number of commands are imported that are used to turn a Spark standalone cluster on and off.  After these commands are imported, a Spark cluster is started, a Spark job is submitted, and then the Spark cluster is shut down after the Spark job finishes.  In the Spark submission command, the SPARK_URL  variable represents your Spark cluster’s master node.

You will now need to edit this script so that the spark-submit command reflects your Spark application instead of the wordcount application (which is used as an example).