Job Control

Introduction

In this page you will find some of the Slurm commands related to job control.

List of Commands

SLURM offers a number of helpful commands for tasks ranging from job submission and monitoring to modifying resource requests for jobs that have already been submitted to the queue. Below is a list of SLURM commands.

sbatch

The sbatch command is used for submitting jobs to the cluster. The command sbatch accepts a number of options either from the command line, or (more typically) from a batch script. An example of a SLURM batch script (called simple.slurm) is shown below:

#!/bin/bash

#SBATCH -N 1

#SBATCH -c 1

#SBATCH --mem-per-cpu=1G

#SBATCH --time=0-00:15:00 # 30 minutes

#SBATCH --output=my.stdout

#SBATCH --mail-user=abac123@case.edu

#SBATCH --mail-type=ALL

#SBATCH --job-name="just_a_test"

# Put commands for executing job below this line

# This example is loading Python 2.7.8 and then writing out the version of Python

module load python

python --version

To submit this batch script, a user would type:

sbatch simple.slurm

This job (called just_a_test) requests 1 compute node, 1 task (by default, SLURM will assign 1 CPU core per task), 1 GB of RAM per CPU core, and 15 minutes of wall time (the time required for the job to complete). Note that these are the defaults for any job, but it is good practice to include these lines in a SLURM script in case you need to request additional resources.

Optionally, any #SBATCH line may be replaced with an equivalent command-line option. For instance, the #SBATCH –ntasks=1 line could be removed and a user could specify this option from the command line using:

sbatch --ntasks=1 simple.slurm

The commands needed to execute a program must be included beneath all #SBATCH commands. Lines beginning with the # symbol (without /bin/bash or SBATCH) are comment lines that are not executed by the shell. The example above simply prints the version of Python loaded in a user’s path. It is good practice to include any setpkgs commands in your SLURM script. A real job would likely do something more complex than the example above, such as read in a Python file for processing by the Python interpreter.

For more information about sbatch see: http://slurm.schedmd.com/sbatch.html

squeue

squeue is used for viewing the status of jobs. By default, squeue will output the following information about currently running jobs and jobs waiting in the queue: Job ID, Partition, Job Name, User Name, Job Status, Run Time, Node Count, and Node List. There are a large number of command-line options available for customizing the information provided by squeue. Below are a list of examples:

For more information about squeue see: http://slurm.schedmd.com/squeue.html

Similar command to "showq" or "qstat"

squeue -u sxg125

output:

JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)

661587 batch bash sxg125 R 22:21 1 comp150t

Note the jobID (661587), status of the Job (R-> Running) and the compute node (comp150t) that the job is running.

Want to see details such as why your job is in PD (pending) state, in which node your job is running.

output:

730814 batch slurm.sl bga11 PD 0:00 1 4 6950 (AssocMaxWallDurationPerJobLimi

730815 batch slurm.sl bga11 PD 0:00 1 4 6950 (AssocMaxWallDurationPerJobLimi

989833 batch 3DClasse txh310 R 21:59:56 16 240 3044 comp145t,comp146t,comp147t,comp149t,comp151t,comp154t,comp156t,comp157t,comp158t,comp159t,comp179t,comp185t,comp186t,comp187t,comp191t,comp192t

992383 batch job_chec sxl1036 R 1:43:13 2 16 3007 comp122t,comp123t

Also, show the start time and end time of the job:

squeue -u <CaseID> -o "%.9i %.9P %.8j %.8u %.2t %.10M %.6D %S %e"

output:

JOBID PARTITION NAME USER ST TIME NODES START_TIME END_TIME

676101 batch JOB sxg125 PD 0:00 1 2016-04-09T15:25:21

606057 batch JOB sxg125 R 8-01:08:45 1 2016-03-31T14:17:02 2016-04-31T14:17:02

606056 batch JOB sxg125 R 8-01:10:16 1 2016-03-31T14:15:31 2016-03-31T14:15:31

The job 676101 is estimated to start on April 09 at 15:25 and the end time of job 606057 is April 31 at 14:17.

Filtering squeue output through awk may be useful, for example, to isolate entries with group name in common:

squeue -o "%A %C %e %E %g %l %m %N %T %u" | awk 'NR==1 || /eecs600/'

output:

JOBID CPUS END_TIME DEPENDENCY GROUP TIME_LIMIT MIN_MEMORY NODELIST STATE USER

148137 1 2016-01-26T16:54:22 eecs600 2:00:00 1900 comp145t RUNNING aar93

148146 1 2016-01-27T01:14:27 eecs600 10:00:00 1900 comp148t RUNNING hxs356

Note the jobs status for the users in a group eecs600

sacct

This command is used for viewing information for completed jobs. This can be useful for monitoring job progress or diagnosing problems that occurred during job execution. By default, sacct will report Job ID, Job Name, Partition, Account, Allocated CPU Cores, Job State, and Exit Code for all of the current user’s jobs that completed since midnight of the current day. Many options are available for modifying the information output by sacct:

The –format option is particularly useful, as it allows a user to customize output of job usage statistics. We would suggest create an alias for running a customized version of sacct. For instance, the elapsed and Timelimit arguments allow for a comparison of allocated vs. actual wall time. MaxRSS and MaxVMSize shows maximum RAM and virtual memory usage information for a job, respectively, while ReqMem reports the amount of RAM requested.

See the status of your job. Note that your executable should be preceded by "srun" command for both serial and MPI executable.

sacct -o JobID,JobName,AveCPU,AvePages,AveRSS,MaxRSSNode,AveVMSize,NTasks,State,ExitCode -j <jobID>

output:

JobID JobName AveCPU AvePages AveRSS MaxRSSNode AveVMSize NTasks State ExitCode

------------ ---------- ---------- ---------- ---------- ---------- ---------- -------- ---------- --------

1013605 v2o5band COMPLETED 0:0

1013605.bat+ batch 00:00:00 0 6244K comp162t 308544K 1 COMPLETED 0:0

For more information about sacct see: http://slurm.schedmd.com/sacct.html

scancel

It kills the job.

Example:

scancel -i 681457

prompt:

Cancel job_id=681457 name=bash partition=batch [y/n]? y

srun: Force Terminated job 681457

Cancel all the jobs related to the caseID

scancel -u <caseID>

scontrol

scontrol is used for monitoring and modifying queued jobs. One of its most powerful options is the scontrol show job option. scontrol is also used for holding and releasing jobs. Below is a list of useful scontrol commands:

Example:

scontrol show job 136355

output:

JobId=136355 JobName=xxxxx

UserId=xxxx(yyyy) GroupId=xxx(yyy)

Priority=3007 Nice=0 Account=gray QOS=normal

JobState=RUNNING Reason=None Dependency=(null)

Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0

RunTime=20:07:27 TimeLimit=13-07:00:00 TimeMin=N/A

SubmitTime=2016-01-18T15:37:55 EligibleTime=2016-01-18T15:37:55

StartTime=2016-01-18T15:37:56 EndTime=2016-01-31T22:37:56

PreemptTime=None SuspendTime=None SecsPreSuspend=0

Partition=batch AllocNode:Sid=hpctest:39249

ReqNodeList=(null) ExcNodeList=(null)

NodeList=comp148t

BatchHost=comp148t

NumNodes=1 NumCPUs=8 CPUs/Task=1 ReqB:S:C:T=0:0:*:*

Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*

MinCPUsNode=1 MinMemoryNode=48G MinTmpDiskNode=0

Features=(null) Gres=(null) Reservation=(null)

Shared=OK Contiguous=0 Licenses=(null) Network=(null)

Command=/home/xxxx/AAA.sh

WorkDir=/home/xxxx/BBB

StdErr=/home/xxxx/OOO.o

StdIn=/dev/null

StdOut=/home/xxx/OOOO.o

Power= SICP=0

If the job is pending, it will show the reason for pending as well:

...

JobState=PENDING Reason=ReqNodeNotAvail(Unavailable:gpu017t,gpu018t,gpu019t,gpu020t,gpu021t,gpu022t,gpu023t,gpu024t) Dependency=(null)

Here, it shows that the job is waiting for the resources. The gpu nodes are listed because they are currently offline.

SLURM command for information about a node:

scontrol show node comp009t

output:

NodeName=comp009t Arch=x86_64 CoresPerSocket=1

CPUAlloc=1 CPUErr=0 CPUTot=12 CPULoad=0.96 Features=hex24gb

Gres=(null)

NodeAddr=comp009t NodeHostName=comp009t Version=15.08

OS=Linux RealMemory=23000 AllocMem=16384 Sockets=12 Boards=1

State=MIXED ThreadsPerCore=1 TmpDisk=100000 Weight=1 Owner=N/A

BootTime=2016-03-02T13:58:01 SlurmdStartTime=2016-03-17T08:26:18

CapWatts=n/a

CurrentWatts=0 LowestJoules=0 ConsumedJoules=0

ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

Here, the number of processors (ncpus) is 12, and available Memory (availmem) is 23000 (~ 23gb).

For more information about scontrol see: http://slurm.schedmd.com/scontrol.html

srun

srun can be used to run interactive jobs, with or without graphics

srun --x11 -N 1 -c 2 --time=1:00:00 --pty /bin/bash

This will launch two tasks on a single node for 1 hour, with graphical windows ready.

This command can also be used to launch a parallel job step. Typically, srun is invoked from a SLURM job script to launch a MPI job (much in the same way that mpirun or mpiexec are used). More details about running MPI jobs within SLURM are provided below. Please note that your application must include MPI code in order to run in parallel across multiple CPU cores using srun. Invoking srun on a non-MPI command or executable will result in this program being independently run X times on each of the CPU cores in the allocation.

Alternatively, srun can be run directly from the command line on a gateway, in which case srun will first create a resource allocation for running the parallel job. The -n [CPU_CORES] option is passed to specify the number of CPU cores for launching the parallel job step. For example, running the following command from the command line will obtain an allocation consisting of 16 CPU cores and then run the command hostname across these cores:

srun -n 16 hostname

For more information about srun see: http://www.schedmd.com/slurmdocs/srun.html

sinfo

sinfo allows users to view information about SLURM nodes and partitions. A partition is a set of nodes (usually a cluster) defined by the cluster administrator. Below are a few example uses of sinfo:

Note: If you want to get the detailed options equivalent to "showq" and "mdiag -n"

si or sinfo -a -o "%P %a %l %D %N %C"

output:

PARTITION AVAIL TIMELIMIT NODES NODELIST CPUS(A/I/O/T)

smp up 13-08:00:00 2 smp04t,smp05t 1/71/0/7

Here, (A/I/O/T) represents "allocated/idle/other (offline/down)/total". The alias for that long command is "si", for total allocation, use the command "sc"

$sc CPUS(A/I/O/T) 318/1166/20/1504 Utilization: 21.1436%

Equivalent to "mdiag -n":

sinfo -p batch -Nle -o '%n %C %t'

or,

siall

output:

NODELIST AVA TIMELIMIT NODE CPUS(A/I/O/T) CPU_LOAD MEMORY FEATURES REASON

comp001t up 13-08:00:0 1 3/9/0/12 1.87 23000 hex24gb none

comp002t up 13-08:00:0 1 9/3/0/12 3.00 23000 hex24gb none

...

Reasons for possible node failure:

sinfo -R

For more information about sinfo see: http://slurm.schedmd.com/sinfo.html

If you want to check your group allocation and the resources used by other members in the group, use the information (i) command:

output:

****Your SLURM's CPU Quota****

xxx 256

****Your Current Jobs****

JOBID PRIOR ST ACCOUNT PARTITION NODES CPU MIN_MEMORY TIME_LIMIT NODELIST

1931308 1012 R xxx batch 3 36 72K 5-00:00:00 comp208t,comp209t,comp210t

1935896 1004 R xxx batch 1 12 24K 2-12:00:00 comp186t

1935867 1003 R xxx batch 1 6 12K 2-12:00:00 comp050t

1934798 1003 R xxx batch 1 6 12K 2-12:00:00 comp049t

****Group's Jobs****

Account:yxk

JOBID USER PRIOR ST PARTITION NODES CPU MIN_MEMORY TIME_LIMIT NODELIST

Here, the group can run upto 256 processors. The members in the group have already used 60 processors (36 + 12 + 6 + 6) out of the allocation.

sreport

sreport is used for generating reports of job usage and cluster utilization. It queries the SLURM database to obtain this information. By default information will be shown for jobs run since midnight of the current day. Some examples:

For more information about sreport see: http://slurm.schedmd.com/sreport.html

sstat

Display various status information of a running job/step (Refer to SLURM man page).

sstat -j <jobID>

Very Important: If you are submitting the job using sbatch, please include srun before your executable in your SLURM batch script as showed:

srun ./<executable>

Selecting the fields of interest

sstat -p --format=AveCPU,AvePages,AveRSS,MaxRSSNode,AveVMSize,NTasks,JobID -j 661587

output:

00:00.000|0|2264K|comp150t|119472K|1|661587.0|

To estimate how much memory is being consumed by run a top command in the node where your job is running.

sq | grep <caseID>

output:

1958082 batch Tumor-PIPE-a <caseID> R 19:40:29 1 4 1002 comp153t

ssh -t comp153t top

output:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

21348 jxw773 20 0 14.9g 14g 1072 S 400.0 22.6 678:30.15 bwa

Note the 22.6% of 64gb comes out to be about 15gb of memory.

Job Dependency

There is an sbatch switch "--dependency" that will defer running a job until a list of other jobs have completed:

https://slurm.schedmd.com/sbatch.html#OPT_dependency

Slurm Efficiency (seff)

seff <jobID>

output:

Job ID: <jobID>

Cluster: smaster2

User/Group: <userID>/<groupID>

State: COMPLETED (exit code 0)

Nodes: 1

Cores per node: 24

CPU Utilized: 00:00:28

CPU Efficiency: 0.01% of 5-05:34:24 core-walltime

Memory Utilized: 50.90 GB (estimated maximum)

Memory Efficiency: 79.52% of 64.00 GB (64.00 GB/node)