SLURM

SLURM Introduction

A structured reference for daily work on SLURM‑controlled high‑performance computing (HPC) clusters.

Table of Contents

SLURM Introduction

Concepts

Check Cluster & Queue

sinfo - information about nodes/partitions

squeue - information about job queue (ongoing & pending jobs)

Start, Stop & Update Jobs

Job Script - Python Example

Concepts

Slurm is an open-source workload manager designed for Linux clusters that manages job scheduling, resource allocation, and workload distribution across compute nodes.

The following terms are relevant to know and understand:

Node

Physical or virtual machine providing CPU, memory, GPUs, and the like. Usually one server.

Partition

A logical grouping of nodes with common limits, priority, and/or hardware specs.

Job

A user‑defined workload managed by SLURM. (see https://slurm.schedmd.com/job_launch.html)

Step

A sub‑task inside a job (e.g., MPI rank group). (see https://slurm.schedmd.com/SLUG24/Step-Management.pdf)

Array Job

Job arrays offer a mechanism for submitting and managing collections of similar jobs quickly and easily. All jobs must have the same initial options (e.g. size, time limit, etc.). Scripts are shared and usually indexed by $SLURM_ARRAY_TASK_ID. (see https://slurm.schedmd.com/job_array.html)

SLURM schedules jobs onto nodes based on requested resources & policies such as fair‑share and QoS.

For a full quick-start overview please see https://slurm.schedmd.com/quickstart.html.

Check Cluster & Queue

sinfo - information about nodes/partitions

Show information about partitions, availability, and their assigned nodes:

sinfo

Show all nodes in a list and their states as well as partition:

sinfo -N

List the number of nodes, CPU cores, memory, ... by partition:

sinfo -o "%P %D nodes, %C cores, %m mem, %E state"

squeue - information about job queue (ongoing & pending jobs)

Check the entire queue of currently running jobs:

squeue

Check only your jobs:

squeue -u $USER

Display detailed information about a job:

scontrol show job <jobid>

Start, Stop & Update Jobs

Stop a specific job:

scancel <jobid>

Modify a job, e.g., re-set the time limit:

scontrol update jobid=<jobid> TimeLimit=03:00:00

Start a job script (recommended):

sbatch path/to/your_slurm_job_script.sh

Start a program manually in terminal:

srun --mem=200G --cpus-per-task=2 --ntasks=1 ./quick_but_memory_intensive_task.sh

Job Script - Python Example

#!/bin/bash

#SBATCH --ntasks-per-node=1 # <- one task, i.e., parallel process per node

#SBATCH --nodes=1 # <- number of nodes, i.e., number of servers to request

#SBATCH --cpus-per-task=32 # <- how many CPUs to allocate

#SBATCH --partition=all # <- set a filter for a specific group of nodes; "all" means any node (no filter)

#SBATCH --gres=gpu:L40:1 # <- request one L40 GPU

#SBATCH --output="/local/your-user/logs/jobs/job_%j.log" # <- re-direct stdout logs

#SBATCH --error="/local/your-user/logs/jobs/job_%j.error" # <- re-direct stderr logs

echo "device=$CUDA_VISIBLE_DEVICES" # <- execute arbitrary commands, e.g., print the available GPU(s)

source /home/your-user/miniconda3/bin/activate # <- start a conda environment (or any other Python venv)

conda activate your-conda-env # <- start your project-specific environment

cd /home/your-user/your-repository # <- let the script change into your project's repository

export MODEL="edgenext" # <- export arbitrary environment variables

export EPOCHS=500 #

# run your script -- expose the SLURM job ID for logging purposes (e.g., you could store this in wandb)

python train.py --batch-size 16 --num-workers 32 --slurm-job-id $SLURM_JOB_ID

Page updated

Report abuse