Running Spark in an Interactive Jupyter Notebook

The Spark Standalone Cluster application allows you to bring up a Spark instance on Greene and interact with it live.

Connect to the Open OnDemand Web Dashboard.  In the dropdown menu in the header, select Interactive Apps > Spark Standalone Cluster.  Set the cluster's parameters using the provided form, and then click Launch.

You will be automatically redirected to an info card for the session, or you can find it again under My Interactive Sessions. Once the cluster has finished startup, you should see the following:

The Host link will take you to a shell interface with the remote machine running your cluster.

The Standalone Cluster Web UI link provides monitoring and usage information, as well as the cluster's SPARK_URL for directing batch jobs as outlined in Running Multi-Node Spark with Singularity.

The Jupyter Notebook Environment link takes you to your home or scratch file directory on Greene, where you will be able to create or open Jupyter notebooks that run against your standalone cluster.

For code examples, see Big Data Tutorial: Spark.

Adding custom python libraries to Spark Standalone Cluster

Build an overlay with your python libraries

Adding python libraries to the Spark Standalone Cluster requires a custom overlay which is compatible with the existing Spark Standalone images.  

If you have built a Miniconda environment before, e.g. by following the PyTorch Example tutorial, this process should be familiar.  However, instead of building installations from scratch, we will start by cloning an existing pyspark image, and install additional packages on top of it.  Finally, before saving we will clean up files and packages in the overlay that might overwrite data in the Spark Standalone images.

Create a working directory.

mkdir /scratch/<NetID>/pysparkcd /scratch/<NetID>/pyspark

Copy an appropriately sized overlay image. In this example we use overlay-15GB-500K.ext3.gz.  It has 15GB free space inside and is able to hold 500K files

cp -rp /scratch/work/public/overlay-fs-ext3/overlay-15GB-500K.ext3.gz .gunzip overlay-15GB-500K.ext3.gzmv overlay-15GB-500K.ext3 pyspark.ext3

Launch a Singularity container to open the overlay in read/write mode.

singularity exec --overlay pyspark.ext3:rw /scratch/work/public/singularity/ubuntu-20.04.3.sif /bin/bash

Next, inside the container, download and install Miniconda to /ext3/miniconda3.

wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.shbash Miniconda3-latest-Linux-x86_64.sh -b -p /ext3/miniconda3# rm Miniconda3-latest-Linux-x86_64.sh # if you don't need this file any longer

Create a wrapper script to activate your base conda environment.  Open /ext3/env.sh and write the following:

#!/bin/bash
source /ext3/miniconda3/etc/profile.d/conda.shexport PATH=/ext3/miniconda3/bin:$PATHexport PYTHONPATH=/ext3/miniconda3/bin:$PATH

Activate the conda environment, update, and clean. Finally, exit singularity to continue the installation on a compute node.

source /ext3/env.sh
conda update -n base conda -yconda clean --all --yes
# Exit Singularityexit

Clone a pyspark image and install additional packages

First, start an interactive job with adequate compute and memory resources to install packages. This will allow you to install large packages which might otherwise crash on the login node with its 2GB memory limit.

srun --cpus-per-task=2 --mem=10GB --time=04:00:00 --pty /bin/bash

Once a node is assigned, open your overlay in read/write mode and activate the base environment.

singularity exec --overlay pyspark.ext3:rw /scratch/work/public/singularity/ubuntu-20.04.3.sif /bin/bash
source /ext3/env.sh

Now, clone an existing pyspark environment to use as base. This command uses the file flag -f to create a new conda environment from an existing config file. The prefix flag -p creates the environment in its own top-level directory, /ext3/pyspark.  Make sure you use the same pyspark version here that you intend to use when starting up the Spark Standalone Cluster.

cp /scratch/work/public/apps/pyspark/<version>/pyspark.yml .conda env create -f pyspark.yml -p /ext3/pyspark

Activate the new environment and install the libraries you need using conda or pip.

conda activate /ext3/pyspark
# Ex: install pydub to handle audio filespip install pydub

To double check your installations, you can open an interactive python window within the container.  Confirm that your custom libraries are installed with the correct version.

python3>>> import pyspark>>> print(pyspark.__version__)3.3.0>>> exit()

Once you are happy with your python environment, clean up files in /ext3 that would conflict with the default Spark Standalone images. This is necessary to make your overlay compatible with the Spark Standalone Cluster.

rm /ext3/env.sh rm -R /ext3/miniconda3/

As a final optional step, compress your overlay into a squashfs filesystem to save space.

# Exit the ubuntu singularity instanceexit
# Compress the overlaysingularity exec --overlay pyspark.ext3:ro /scratch/work/public/singularity/centos-8.2.2004.sif mksquashfs /ext3 pyspark.sqf -keep-as-directory -all-root

Add the overlay when initializing Spark Standalone Cluster

When launching a Spark Standalone Cluster from the Open OnDemand dashboard, include the path to your custom overlay at the bottom of the form.  You should then be able to access your custom libraries from any Jupyter notebooks run against the cluster.  The same overlay can be re-used in as many cluster instances as needed.