Running Multi-Node Spark with Singularity on Greene
Make a directory within your home directory for your analysis (mkdir ~/myanalysis). This directory will also be used by Spark to write config files / logs to.
Navigate to that directory using Linux's cd command.
Copy the example Spark submission script to your analysis directory:
cp /scratch/work/public/apps/pyspark/3.1.2/examples/spark/cluster/run-spark.bash .Copy the example SLURM submission script to your analysis directory:
cp /scratch/work/public/apps/pyspark/3.1.2/examples/spark/cluster/run-spark-singularity.sbatch .By default run-spark.bash looks like the following:
#!/bin/bashsource /scratch/work/public/apps/pyspark/3.1.2/scripts/spark-setup-slurm.bash
start_all
spark-submit \ --master $SPARK_URL \ --executor-memory $MEMORY \ wordcount.py /scratch/work/public/apps/pyspark/3.1.2/examples/shakespeare-8G.txt
stop_all
First, a number of commands are imported that are used to turn a Spark standalone cluster on and off. After these commands are imported, a Spark cluster is started, a Spark job is submitted, and then the Spark cluster is shut down after the Spark job finishes. In the Spark submission command, the SPARK_URL variable represents your Spark cluster’s master node.
You will now need to edit this script so that the spark-submit command reflects your Spark application instead of the wordcount application (which is used as an example).
Edit the copy of run-spark-singularity.sbatch within your directory so that the #SBATCH options reflect the resources that you wish to request. Note that currently only whole node allocations are supported, so the line starting with #SBATCH --nodes= must be present.
Type sbatch run-spark-singularity.sbatch from within your analysis directory to start your Spark cluster / job.