A structured reference for daily work on SLURM‑controlled high‑performance computing (HPC) clusters.
Slurm is an open-source workload manager designed for Linux clusters that manages job scheduling, resource allocation, and workload distribution across compute nodes.
The following terms are relevant to know and understand:
Node
Physical or virtual machine providing CPU, memory, GPUs, and the like. Usually one server.
Partition
A logical grouping of nodes with common limits, priority, and/or hardware specs.
Job
A user‑defined workload managed by SLURM. (see https://slurm.schedmd.com/job_launch.html)
Step
A sub‑task inside a job (e.g., MPI rank group). (see https://slurm.schedmd.com/SLUG24/Step-Management.pdf)
Array Job
Job arrays offer a mechanism for submitting and managing collections of similar jobs quickly and easily. All jobs must have the same initial options (e.g. size, time limit, etc.). Scripts are shared and usually indexed by $SLURM_ARRAY_TASK_ID. (see https://slurm.schedmd.com/job_array.html)
SLURM schedules jobs onto nodes based on requested resources & policies such as fair‑share and QoS.
For a full quick-start overview please see https://slurm.schedmd.com/quickstart.html.
Show information about partitions, availability, and their assigned nodes:
sinfo
Show all nodes in a list and their states as well as partition:
sinfo -N
List the number of nodes, CPU cores, memory, ... by partition:
sinfo -o "%P %D nodes, %C cores, %m mem, %E state"
Check the entire queue of currently running jobs:
squeue
Check only your jobs:
squeue -u $USER
Display detailed information about a job:
scontrol show job <jobid>
Stop a specific job:
scancel <jobid>
Modify a job, e.g., re-set the time limit:
scontrol update jobid=<jobid> TimeLimit=03:00:00
Start a job script (recommended):
sbatch path/to/your_slurm_job_script.sh
Start a program manually in terminal:
srun --mem=200G --cpus-per-task=2 --ntasks=1 ./quick_but_memory_intensive_task.sh
#!/bin/bash
#SBATCH --ntasks-per-node=1 # <- one task, i.e., parallel process per node
#SBATCH --nodes=1 # <- number of nodes, i.e., number of servers to request
#SBATCH --cpus-per-task=32 # <- how many CPUs to allocate
#SBATCH --partition=all # <- set a filter for a specific group of nodes; "all" means any node (no filter)
#SBATCH --gres=gpu:L40:1 # <- request one L40 GPU
#SBATCH --output="/local/your-user/logs/jobs/job_%j.log" # <- re-direct stdout logs
#SBATCH --error="/local/your-user/logs/jobs/job_%j.error" # <- re-direct stderr logs
echo "device=$CUDA_VISIBLE_DEVICES" # <- execute arbitrary commands, e.g., print the available GPU(s)
source /home/your-user/miniconda3/bin/activate # <- start a conda environment (or any other Python venv)
conda activate your-conda-env # <- start your project-specific environment
cd /home/your-user/your-repository # <- let the script change into your project's repository
export MODEL="edgenext" # <- export arbitrary environment variables
export EPOCHS=500 #
# run your script -- expose the SLURM job ID for logging purposes (e.g., you could store this in wandb)
python train.py --batch-size 16 --num-workers 32 --slurm-job-id $SLURM_JOB_ID