SLURM Scheduler

SLURM (Simple Linux Utility for Resource Management) is a job scheduler and workload manager.

It manages access to compute servers (nodes), provides a framework for running workloads - usually parallel jobs - on these nodes, and manages a queue of pending jobs contending for resources. For a more detailed introduction to SLURM, refer to:

- SchedMD SLURM site
- SLURM Quick Start User Guide

SLURM is a critical component for a large computational resource such as the Coeus and Gaia HPC clusters. Despite the "Simple" in its name, SLURM is a fairly complicated tool, and may require some familiarization to accomplish your desired workflow.

Getting up to speed...

Initially your goal is just to get your process running.

1. Make sure your software runs. You can do this on the login nodes... but, just to test it. Remember: Don't run compute jobs on the login nodes!
  1. - Make sure you've loaded the appropriate environment modules.
    - Create a working directory on the /scratch volume and make sure your application can write to it.
    - Do you have all configuration, data, and other files in the appropriate working directories?
2. Will you be running a batch process or interactively? Both are supported.
  1. - Batch processes are "fire-and-forget" and don't request interaction until the process is done. The scheduler supports multi-step batch processes
    - Interactive sessions allow you to manually run commands and can be run either through a shell or a graphics interface.

Then you want to make sure you can run it with the scheduler, making sure that you're using proper input data and getting all required output. Lastly, you should make sure your process is running efficiently, using system resources as effectively as possible. The typical steps for getting a computational process set up on the cluster are:

1. Edit a sbatch submit script to send your job to the scheduler (refer to examples below)
2. Test your submit script
  1. - Does your sbatch script load the correct modules?
    - Do you get slurm errors when submitting an sbatch job?
    - Are application output and error files going to your working directory?
3. Optimize your job submission, if necessary. Check out this FAQ on how to how to analyze your job performance
4. Also, remember to cancel test jobs!

Common SLURM Commands

There are a number of user commands users regularly employ for working with the scheduler.

- sbatch - Submit a job script to the scheduler for execution. This script typically contains one or more srun commands to launch batch jobs or mpiexec commands to launch parallel tasks on compute nodes. One advantage of the sbatch script is that it's self-documenting.
  - To filter the jobs of only one user, use sbatch -u <username>.
- squeue - This reports the state of jobs or job steps. This is useful view check what’s in the current job queue, especially if you’re going to submit a larger job using many nodes.
- scancel - Allows you to cancel a pending or running job.
- sinfo - This reports the state of partitions and nodes managed by Slurm. There are a number of filtering, sorting, and formatting options.
- salloc - Request an allocation. User can then run any program available on the allocated nodes. Typically this used for processes not readily run as batch processes, interactive sessions, etc.
- mpiexec - This command is used to submit a parallel MPI process. Typically this will be included in an sbatch script.
- srun - Initiate commands on compute nodes in real time or in an sbatch script. Typically this is used to run non-MPI processes on compute nodes, but is also used with MPICH parallel jobs (refer to examples below).

Example SBATCH Submission Scripts

The sbatch command is used to submit a job script to the scheduler for execution. This script typically contains one or more srun to launch batch jobs or mpiexec commands to launch parallel tasks. This is a shell script, so you can execute any command you can run in a batch script. There are special sbatch commands that are flagged with the #SBATCH

- If you're running an MPI binary, launch it with mpiexec (prefered) or mpirun.
- When running non-MPI programs, and multiple different programs inside a single batch submission, then srun is the appropriate launcher.

For more on --nodes, --ntasks, --ntasks-per-node, and more, refer to the bottom of this page's link to SLURM Parallelism.

Simple SBATCH example

File: sub_simplest.sh

#!/bin/bash # Required.

#SBATCH --job-name simple # Set the name that shows up in squeue.

#SBATCH --nodes 2 # Use 2 nodes.

srun hostname # hostname will print system name.

# If 'srun' is omitted, this will only run on one node.

# So this script will print the system name of each of the 2 nodes it runs on.

File: sub_simple.sh

#!/bin/bash

#SBATCH --job-name simple

#SBATCH --nodes 2

# The %j variable includes the job number. Useful for multiple runs

#SBATCH --output simple_%j.txt # Send the standard output to simple_<job ID>.txt

#SBATCH --error simple_%j.err # Send the error output to simple_<job ID>.err

srun hostname

File: sub_matlab.sh

#!/bin/bash

## super simple matlab example

#SBATCH --job-name myjob

#SBATCH --nodes 1

#SBATCH --partition medium # Specify to put this job in the medium partition. If

# not specified, the partition is set to medium.

#SBATCH --output myjob_%j.txt

#SBATCH --error myjob_%j.err

module load General/matlab/R2018a # Load the Matlab module.

srun matlab -nodisplay -nojvm -r mymatlab # Start Matlab without a GUI, and others.

Simple MPICH example

This is a simple "Hello World" MPI example

File: sub_mpi_hello.sh

#!/bin/bash

#SBATCH --job-name mpi_hello

#SBATCH --nodes 2 # Use 2 nodes.

#SBATCH --ntasks-per-node 20 # For every node, allocate 20 copies of this job.

#SBATCH --time 10:00 # Set the maximum time that the job can run.

#SBATCH --output mpi_hello_%j.txt

#SBATCH --error mpi_hello_%j.err

module load mpich/gcc

srun --mpi=pmi2 mpi_hello

If you're running an MPI binary, launch it with mpiexec (prefered) or mpirun, with the exception of mpich
When running non-MPI programs, and multiple different programs inside a single batch submission, then srun is the appropriate launcher, with the exception of mpich.

Simple MVAPICH2-2.2 MPI example

This is a simple "Hello World" MPI example

File: sub_mpi_hello.sh

#!/bin/bash

#SBATCH --job-name mpi_hello

#SBATCH --nodes 2

#SBATCH --ntasks-per-node 20

#SBATCH --time 10:00

#SBATCH --output mpi_hello_%j.txt

#SBATCH --error mpi_hello_%j.err

## Load the MPICH for GCC-8.2.0 module

module load mvapich2-2.2-psm/gcc-8.2.0

mpiexec ./mpi_hello

If you're running an MPI binary, launch it with mpiexec (prefered) or mpirun, with the exception of mpich
When running non-MPI programs, and multiple different programs inside a single batch submission, then srun is the appropriate launcher, with the exception of mpich.

Simple MPI Python example (using mpi4py)

This example uses the mpi4py package. To install this package, you will first need to create a virtual environment to contain the project, and use pip to install the mpi4py package there. For more on package management and virtual environments, refer to the Virtual Environments How-To.

Python code (hello_world_mpi.py):

from mpi4py import MPI

comm = MPI.COMM_WORLD

rank = comm.Get_rank()

print("hello world from process ", rank)

Submission script (hello_mpi_python.sh):

#!/bin/bash

#SBATCH --job-name hello_world_mpi

#SBATCH --time 00:10:00

#SBATCH --nodes 2

#SBATCH --output hello_world_mpi_py.txt

#SBATCH --ntasks 4 # Allocate 4 possible jobs to fill.

module load Python/gcc/3.7.5/gcc-6.3.0

mpiexec -np 4 python3 hello_world_mpi.py # Fill all 4 allocated jobs with a copy of this

# script and have them communicate with MPI.

Program output (hello_world_mpi_py.txt):

hello world from process 0

hello world from process 2

hello world from process 1

hello world from process 3

Generic SBATCH Script Guideline

This is a suggestion to help make writing an SBATCH script easier - anything in <> is to be replaced with custom information.

#!/bin/bash

#SBATCH --job-name <Give your script a name.>

#SBATCH --partition <Select your partition.>

#SBATCH --output <Standard output file>.txt

#SBATCH --error <Error output file>.err

module load <Whatever module(s) that will be needed, if any.>

A Deeper examination of Parallelism with SLURM

Visit the page on SLURM Parallelism for information on --nodes, --ntasks, --ntasks-per-node, and more.

Job Arrays

According to the Slurm Job Array Documentation, “job arrays offer a mechanism for submitting and managing collections of similar jobs quickly and easily.” In general, job arrays are useful for applying the same processing routine to a collection of multiple input data files. Job arrays offer a very simple way to submit a large number of independent processing jobs.

A specified number of array tasks will be created by submitting a single job array sbatch script.

For more on job arrays, visit the Job Arrays page.

Running Interactive processes on compute nodes

There are times that you may want to run an interactive session on a compute node. For example, you may want to use the matlab graphic interface or command line interface.

Submitting Interactive Jobs using SLURM's salloc command

Report abuse