scontrol

scontrol is used for monitoring and modifying queued jobs. One of its most powerful options is the scontrol show job option. scontrol is also used for holding and releasing jobs. Below is a list of useful scontrol commands:

Example: Show  a job

scontrol show job <jobid>    # get the jobID using squeue -u <caseID>

output:

JobId=136355 JobName=xxxxx

   UserId=xxxx(yyyy) GroupId=xxx(yyy)

   Priority=3007 Nice=0 Account=gray QOS=normal

   JobState=RUNNING Reason=None Dependency=(null)

   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0

   RunTime=20:07:27 TimeLimit=13-07:00:00 TimeMin=N/A

   SubmitTime=2016-01-18T15:37:55 EligibleTime=2016-01-18T15:37:55

   StartTime=2016-01-18T15:37:56 EndTime=2016-01-31T22:37:56

   PreemptTime=None SuspendTime=None SecsPreSuspend=0

   Partition=batch AllocNode:Sid=hpctest:39249

   ReqNodeList=(null) ExcNodeList=(null)

   NodeList=comp148t

   BatchHost=comp148t

   NumNodes=1 NumCPUs=8 CPUs/Task=1 ReqB:S:C:T=0:0:*:*

   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*

   MinCPUsNode=1 MinMemoryNode=48G MinTmpDiskNode=0

   Features=(null) Gres=(null) Reservation=(null)

   Shared=OK Contiguous=0 Licenses=(null) Network=(null)

   Command=/home/xxxx/AAA.sh

   WorkDir=/home/xxxx/BBB

   StdErr=/home/xxxx/OOO.o

   StdIn=/dev/null

   StdOut=/home/xxx/OOOO.o

   Power= SICP=0

If the job is pending, it will show the reason for pending as well:

...

JobState=PENDING Reason=ReqNodeNotAvail(Unavailable:gpu017t,gpu018t,gpu019t,gpu020t,gpu021t,gpu022t,gpu023t,gpu024t) Dependency=(null)

Here, it shows that the job is waiting for the resources. The gpu nodes are listed because they are currently offline.

Example: Show a node

scontrol show node comp009t

output:

NodeName=comp009t Arch=x86_64 CoresPerSocket=1

   CPUAlloc=1 CPUErr=0 CPUTot=12 CPULoad=0.96 Features=hex24gb

   Gres=(null)

   NodeAddr=comp009t NodeHostName=comp009t Version=15.08

   OS=Linux RealMemory=23000 AllocMem=16384 Sockets=12 Boards=1

   State=MIXED ThreadsPerCore=1 TmpDisk=100000 Weight=1 Owner=N/A

   BootTime=2016-03-02T13:58:01 SlurmdStartTime=2016-03-17T08:26:18

   CapWatts=n/a

   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0

   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

Here, the number of processors (ncpus) is 12, and available Memory (availmem) is 23000 (~ 23gb).

For shortcut:

chk compt320

output:

GRES (Generic Resource) is printed after each jobid

Hostname       Partition     Node Num_CPU  CPUload  Memsize  Freemem  GRES/node    Joblist

compt320          batch*     comp* 16  40   80.08*   191000   108909  (null)       18476787 jgs121 N/A  


NodeName=compt320 Arch=x86_64 CoresPerSocket=20 

   CPUAlloc=16 CPUEfctv=40 CPUTot=40 CPULoad=80.08

   AvailableFeatures=icosa192gb,rds

   ActiveFeatures=icosa192gb,rds

   Gres=(null)

   NodeAddr=compt320 NodeHostName=compt320 Version=22.05.2

   OS=Linux 3.10.0-1160.el7.x86_64 #1 SMP Tue Aug 18 14:50:17 EDT 2020 

   RealMemory=191000 AllocMem=32768 FreeMem=108909 Sockets=2 Boards=1

   State=MIXED+COMPLETING ThreadsPerCore=1 TmpDisk=100000 Weight=1 Owner=N/A MCS_label=N/A

   Partitions=batch 

   BootTime=2023-01-11T17:05:08 SlurmdStartTime=2023-01-23T10:21:03

   LastBusyTime=2023-01-24T21:41:56

   CfgTRES=cpu=40,mem=191000M,billing=40

   AllocTRES=cpu=16,mem=32G

   CapWatts=n/a

   CurrentWatts=0 AveWatts=0

   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

For more information about scontrol see: http://slurm.schedmd.com/scontrol.html