SLURM: Main Commands

Slurm Commands

Submit jobs - [sbatch]

Check cluster status - [sinfo]

Check job status - [squeue, sstat, sacct]

Cancel a job - [scancel]

Slurm Job Results

SLURM Environment Variables

Exceeded Step Memory Limit Warning Message

Slurm Commands

Submit jobs - [sbatch]

Batch job submission can be accomplished with the command sbatch. Like in Torque qsub, we create a bash script to describe our job requirement: what resources we need, what softwares and processing we want to run, memory and CPU requested, and where to send job standard output and error etc. After a job is submitted, Slurm will find the suitable resources, schedule and drive the job execution, and report outcome back to the user. The user can then return to look at the output files.

Example-1:

In the first example, we create a small bash script, run it locally, then submit it as a job to Slurm using sbatch, and compare the results.

$ mkdir -p /scratch/$USER/mytest1$ cd /scratch/$USER/mytest1 $ cat > simple1.sh#!/bin/bashhostnamedatesleep 20date $ chmod +x simple1.sh # This is just for demo purpose. Real work should be submitted # to Slurm to run on computing nodes.$ ./simple1.shlog-1Mon Feb 6 15:34:52 EST 2017Mon Feb 6 15:35:12 EST 2017 $ sbatch simple1.shSubmitted batch job 22140$ cat slurm-22140.outc17-01Mon Feb 6 15:35:21 EST 2017Mon Feb 6 15:35:41 EST 2017

Example-2:

Follow the recipe below to submit a job. The job can be used later as an example for practicing how to check job status. In my test its running duration is about 7 minutes.

$ cd /scratch/$USER/mytest1$ cp /share/apps/Tutorials/slurm/example/run-matlab.s .$ cp /share/apps/Tutorials/slurm/example/thtest.m .$ sbatch run-matlab.sSubmitted batch job 11615

Below is the content of the bash script "run-matlab.s" just used in the job submission:

#!/bin/bash###SBATCH --nodes=1#SBATCH --nodes=1#SBATCH --ntasks-per-node=1#SBATCH --cpus-per-task=2#SBATCH --time=1:00:00#SBATCH --mem=4GB#SBATCH --job-name=myMatlabTest#SBATCH --mail-type=END##SBATCH --mail-user=bob.smith@nyu.edu#SBATCH --output=slurm_%j.outmodule purgemodule load matlab/2020bcd /scratch/$USER/mytest1cat thtest.m | srun matlab -nodisplay

The job has been submitted successfully. And as the example box showing, its job ID is 11615. Usually we should let the scheduler to decide on what nodes to run jobs. In cases there is a need to request a specific set of nodes, use the directive nodelist, e.g. '#SBATCH --nodelist=c09-01,c09-02'.

Check cluster status - [sinfo]

The sinfo command gives information about the cluster status, by default listing all the partitions. Partitions group computing nodes into logical sets, which serves various functionalities such as interactivity, visualization and batch processing.

A partition is a group of nodes. A partition can be made up of nodes with a specific feature/functionality, such as nodes equipped with GPU accelerators (gpu partition). A partition can have specific parameters, such as how long jobs can run. So partitions can be thought as "queues" in other batch systems. Partitions may overlap.

$ sinfoPARTITION AVAIL TIMELIMIT NODES STATE NODELISTc01_25* up 1-00:00:00 4 mix c13-[01-04]c01_25* up 1-00:00:00 95 idle c01-[01-04],c02-[01-04],c03-[01-04],c04-[01-04],c05-[01-04],c06-[01-04],c07-[01-04],c08-[01-04],c09-[01-04],c10-[01-04],c11-[01-04],c12-[01-04],c14-[01-04],c15-[01-04],c16-[01-04],c17-[01-03],c18-[01-04],c19-[01-04],c20-[01-04],c21-[01-04],c22-[01-04],c23-[01-04],c24-[01-04],c25-[01-04]c26 up 1-00:00:00 16 idle c26-[01-16]c27 up 1-00:00:00 16 idle c27-[01-16]gpu up 1-00:00:00 2 mix gpu-[01-02]gpu up 1-00:00:00 7 idle gpu-[03-09]

sinfo by default prints information aggregated by partition and node state. As shown above, there are four partitions namely c01_25, c26, c27 and gpu. The partition marked with an asterisk is the default one. Except there are two lines with the node state 'mix', which means some CPU cores occupied, all other nodes are idle.

See two useful sinfo command examples: 1. the first one lists those nodes in idle state in the gpu partition; 2. the second outputs information in a node-oriented format.

$ sinfo -p cpu_gpu -t idlePARTITION AVAIL TIMELIMIT NODES STATE NODELISTgpu up 1-00:00:00 5 idle gpu-[05-09] $ sinfo -lNeMon Jan 16 15:05:49 2017NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON c01-01 1 c01_25* idle 28 2:14:1 128826 61889 1 (null) none c01-02 1 c01_25* idle 28 2:14:1 128826 61889 1 (null) none c01-03 1 c01_25* idle 28 2:14:1 128826 61889 1 (null) none c01-04 1 c01_25* idle 28 2:14:1 128826 61889 1 (null) none c02-01 1 c01_25* idle 28 2:14:1 128826 61889 1 (null) none c02-02 1 c01_25* idle 28 2:14:1 128826 61889 1 (null) none c02-03 1 c01_25* idle 28 2:14:1 128826 61889 1 (null) none c02-04 1 c01_25* idle 28 2:14:1 128826 61889 1 (null) none c03-01 1 c01_25* idle 28 2:14:1 128826 61889 1 (null) none c03-02 1 c01_25* idle 28 2:14:1 128826 61889 1 (null) none......

Check job status - [squeue, sstat, sacct]

The squeue command lists jobs which are in a state of either running, or waiting or completing etc. It can also display jobs owned by a specific user or with specific job ID.

$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 9874 c01_25 model_ev johd R 17:00 4 c13-[01-04] 9868 gpu relases- xh814 R 17:45:45 1 gpu-01 9869 gpu amberGPU xh814 R 1:30:19 1 gpu-01 9873 gpu pemed_1 johd R 17:08 1 gpu-02 $ squeue -u johd JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 9874 c01_25 model_ev johd R 22:19 4 c13-[01-04] $ squeue -j 9874 -o "%.18i %.9P %.8j %.8u %.8T %.10M %.9l %.6D %R %m" JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON) MIN_MEMORY 9874 c01_25 model_ev johd RUNNING 23:31 1:00:00 4 c13-[01-04] 2000M

Run 'man sinfo' or 'man squeue' to see the explanations for the results.

With the job ID in hand, we can track the job status through its lifetime. The job first appears in the Slurm queue in the PENDING state. Then when its required resources become available, the job gets priority for its turn to run, and is allocated resources, the job will transit to the RUNNING state. Once the job runs to the end and completes successfully, it goes to the COMPLETED state; otherwise it would be in the FAILED state. Use squeue -j <jobID> to check a job status.

$ squeue -j 9877 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 9877 gpu pemed_1 johd R 0:10 1 gpu-02

Most of the columns in the output of the squeue command are self-explanatory.

The column "ST" in the middle is the job status, which can be :

- PD - pending: waiting for resource allocation
- S - suspended
- R - running
- F - failed: non-zero exit code or other failures
- CD - completed: all processes terminated with zero exit code
- CG - completing: in the completing process, some processes may still be alive

The column "NODELIST(REASON)" in the end is job status due to the reason(s), which can be :

- JobHeldUser: (obviously)
- Priority: higher priority jobs exist
- Resources: waiting for resources to become available

- BeginTime: start time not reached yet
- Dependency: wait for a depended job to finish
- QOSMaxCpuPerUserLimit: number of CPU core limit reached

You may select what columns to display, in a width specified with an integer number between %. and a letter, %.10M.

$ squeue -j 9874 -o "%.18i %.9P %.8j %.8u %.8T %.10M %.9l %.6D %R %m" JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON) MIN_MEMORY 9874 c01_25 model_ev johd RUNNING 23:31 1:00:00 4 c13-[01-04] 2000M

Run the command sstat to display various information of running job/step. Run the command sacct to check accounting information of jobs and job steps in the Slurm log or database. There is a '–-helpformat' option in these two commands to help checking what output columns are available.

$ sstat -j 23221 -o JobID,NodeList,Pids,MaxRSS,AveRSS,MaxVMSize JobID Nodelist Pids MaxRSS AveRSS MaxVMSize------------ -------------------- -------------------- ---------- ---------- ----------23221.0 c03-04 158503 3462088K 2681124K 8357328K $ sacct -j 7050 --format JobID,jobname,NTasks,nodelist,MaxRSS,MaxVMSize,AveRSS,AveVMSize JobID JobName NTasks NodeList MaxRSS MaxVMSize AveRSS AveVMSize------------ ---------- -------- --------------- ---------- ---------- ---------- ----------7050 mpiexec-t+ c17-[01-03] 7050.batch batch 1 c17-01 149112K 208648K 149112K 113260K7050.extern extern 3 c17-[01-03] 0 4316K 0 4316K7050.0 orted 2 c17-[02-03] 141016K 370880K 140024K 370868K

Type "man <command>" to look up detailed usage on the manual pages of command squeue, sstat and sacct.

Cancel a job - [scancel]

Things can go wrong, or in a way unexpected. Should you decide to terminate a job before it finishes, scancel is the tool to help.

$ squeue -j 9877 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 9877 gpu pemed_1 johd R 9:04 1 gpu-02$ scancel 9877

Slurm Job Results

Job results includes the job execution logs (standard output and error), and of course the output data files if any defined when submitting the job. Log files should be created in the working directory, and output data files in your specified directory. Examine log files with a text viewer or editor, to gain a rough idea of how the execution goes. Open output data files to see exactly what result is generated. Run sacct command to see resource usage statistics. Should you decide that the job needs to be rerun, submit it again with sbatch with a modified version of batch script and/or updated execution configuration. Iteration is one characteristic of a typical data analysis!

SLURM Environment Variables

To get the list of SLURM_* variables, you may run a job to check, e.g. srun sh -c 'env | grep SLURM | sort' . The command 'man sbatch' explains what these variables stand for. Below are a few frequently used:

- SLURM_JOB_ID - the job ID
- SLURM_SUBMIT_DIR - the job submission directory
- SLURM_SUBMIT_HOST - name of the host from which the job was submitted
- SLURM_JOB_NODELIST - names of nodes allocated to the job
- SLURM_ARRAY_TASK_ID - job array job index
- SLURM_JOB_CPUS_PER_NODE - CPU cores on this node allocated to the job
- SLURM_NNODES - number of nodes allocated to the job

Exceeded Step Memory Limit Warning Message

The current Slurm implementation utilizes Linux Control Groups (cgroups) for resource containment. If necessary please see this page at kernel.org for a detailed description of cgroups.

If you get the correct outputs, please just ignore this warning message - "slurmstepd: error: Exceeded job memory limit at some point". You can also check job exit state to confirm. For reference there is some explanation in the bug report.