scontrol
scontrol is used for monitoring and modifying queued jobs. One of its most powerful options is the scontrol show job option. scontrol is also used for holding and releasing jobs. Below is a list of useful scontrol commands:
Example: Show a job
scontrol show job <jobid> # get the jobID using squeue -u <caseID>
output:
JobId=136355 JobName=xxxxx
UserId=xxxx(yyyy) GroupId=xxx(yyy)
Priority=3007 Nice=0 Account=gray QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=20:07:27 TimeLimit=13-07:00:00 TimeMin=N/A
SubmitTime=2016-01-18T15:37:55 EligibleTime=2016-01-18T15:37:55
StartTime=2016-01-18T15:37:56 EndTime=2016-01-31T22:37:56
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=batch AllocNode:Sid=hpctest:39249
ReqNodeList=(null) ExcNodeList=(null)
NodeList=comp148t
BatchHost=comp148t
NumNodes=1 NumCPUs=8 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=48G MinTmpDiskNode=0
Features=(null) Gres=(null) Reservation=(null)
Shared=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/home/xxxx/AAA.sh
WorkDir=/home/xxxx/BBB
StdErr=/home/xxxx/OOO.o
StdIn=/dev/null
StdOut=/home/xxx/OOOO.o
Power= SICP=0
If the job is pending, it will show the reason for pending as well:
...
JobState=PENDING Reason=ReqNodeNotAvail(Unavailable:gpu017t,gpu018t,gpu019t,gpu020t,gpu021t,gpu022t,gpu023t,gpu024t) Dependency=(null)
Here, it shows that the job is waiting for the resources. The gpu nodes are listed because they are currently offline.
Example: Show a node
scontrol show node comp009t
output:
NodeName=comp009t Arch=x86_64 CoresPerSocket=1
CPUAlloc=1 CPUErr=0 CPUTot=12 CPULoad=0.96 Features=hex24gb
Gres=(null)
NodeAddr=comp009t NodeHostName=comp009t Version=15.08
OS=Linux RealMemory=23000 AllocMem=16384 Sockets=12 Boards=1
State=MIXED ThreadsPerCore=1 TmpDisk=100000 Weight=1 Owner=N/A
BootTime=2016-03-02T13:58:01 SlurmdStartTime=2016-03-17T08:26:18
CapWatts=n/a
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Here, the number of processors (ncpus) is 12, and available Memory (availmem) is 23000 (~ 23gb).
For shortcut:
chk compt320
output:
GRES (Generic Resource) is printed after each jobid
Hostname Partition Node Num_CPU CPUload Memsize Freemem GRES/node Joblist
compt320 batch* comp* 16 40 80.08* 191000 108909 (null) 18476787 jgs121 N/A
NodeName=compt320 Arch=x86_64 CoresPerSocket=20
CPUAlloc=16 CPUEfctv=40 CPUTot=40 CPULoad=80.08
AvailableFeatures=icosa192gb,rds
ActiveFeatures=icosa192gb,rds
Gres=(null)
NodeAddr=compt320 NodeHostName=compt320 Version=22.05.2
OS=Linux 3.10.0-1160.el7.x86_64 #1 SMP Tue Aug 18 14:50:17 EDT 2020
RealMemory=191000 AllocMem=32768 FreeMem=108909 Sockets=2 Boards=1
State=MIXED+COMPLETING ThreadsPerCore=1 TmpDisk=100000 Weight=1 Owner=N/A MCS_label=N/A
Partitions=batch
BootTime=2023-01-11T17:05:08 SlurmdStartTime=2023-01-23T10:21:03
LastBusyTime=2023-01-24T21:41:56
CfgTRES=cpu=40,mem=191000M,billing=40
AllocTRES=cpu=16,mem=32G
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
For more information about scontrol see: http://slurm.schedmd.com/scontrol.html