Running Multi-Node Spark with Singularity on Greene

Make a directory within your home directory for your analysis (mkdir ~/myanalysis). This directory will also be used by Spark to write config files / logs to.

Navigate to that directory using Linux's cd command.

Copy the example Spark submission script to your analysis directory:

cp /scratch/work/public/apps/pyspark/3.1.2/examples/spark/cluster/run-spark.bash .

Copy the example SLURM submission script to your analysis directory:

cp /scratch/work/public/apps/pyspark/3.1.2/examples/spark/cluster/run-spark-singularity.sbatch .

By default run-spark.bash looks like the following:

#!/bin/bash
source /scratch/work/public/apps/pyspark/3.1.2/scripts/spark-setup-slurm.bash
start_all
spark-submit \ --master $SPARK_URL \ --executor-memory $MEMORY \ wordcount.py /scratch/work/public/apps/pyspark/3.1.2/examples/shakespeare-8G.txt
stop_all

First, a number of commands are imported that are used to turn a Spark standalone cluster on and off. After these commands are imported, a Spark cluster is started, a Spark job is submitted, and then the Spark cluster is shut down after the Spark job finishes. In the Spark submission command, the SPARK_URL variable represents your Spark cluster’s master node.

You will now need to edit this script so that the spark-submit command reflects your Spark application instead of the wordcount application (which is used as an example).

Edit the copy of run-spark-singularity.sbatch within your directory so that the #SBATCH options reflect the resources that you wish to request. Note that currently only whole node allocations are supported, so the line starting with #SBATCH --nodes= must be present.
Type sbatch run-spark-singularity.sbatch from within your analysis directory to start your Spark cluster / job.

Page updated

Report abuse