Monitoring & Deleting Jobs
SLURM Commands
On the login-node, to get a status report on all jobs that have been submitted as a SLURM job but not yet completed., use either of these commands. use --help for available options.
squeue --help
scontrol --help sstat --help
Job Status
For brief status of your jobs, use the command:
squeue -u <caseID>
output:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
661587 batch bash sxg125 R 22:21 1 comp150t
Note the jobID (661587), status of the Job (R-> Running) and the compute node (comp150t) that the job is running.
If you want to check your group allocation and the resources used by other members in the group, use the information (i) command:
i
output:
****Your SLURM's CPU Quota****
xxx 256
****Your Current Jobs****
JOBID PRIOR ST ACCOUNT PARTITION NODES CPU MIN_MEMORY TIME_LIMIT NODELIST
1931308 1012 R xxx batch 3 36 72K 5-00:00:00 comp208t,comp209t,comp210t
1935896 1004 R xxx batch 1 12 24K 2-12:00:00 comp186t
1935867 1003 R xxx batch 1 6 12K 2-12:00:00 comp050t
1934798 1003 R xxx batch 1 6 12K 2-12:00:00 comp049t
****Group's Jobs****
Account:yxk
JOBID USER PRIOR ST PARTITION NODES CPU MIN_MEMORY TIME_LIMIT NODELIST
Here, the group can run upto 256 processors. The members in the group have already used 60 processors (36 + 12 + 6 + 6) out of the allocation.
If you would like a complete detail about your job such as on which node it is running, how much physical memory it is consuming, and so on, use the command below. You may also want to use "top" command described in the section "Top Command" below:
sstat -p --format=AveCPU,AvePages,AveRSS,MaxRSSNode,AveVMSize,NTasks,JobID -j <jobID>
output:
AveCPU|AvePages|AveRSS|MaxRSSNode|AveVMSize|NTasks|JobID|
00:00.000|0|2264K|comp150t|119472K|1|661587.0|
RSS is the portion of memory occupied by a process that is held in main memory (RAM). The job has currently used 2264K of RAM (physical memory) and it is running on compute node comp150t.
Very Important: If you are submitting the job using sbatch, please include srun before your executable in your SLURM batch script as showed:
srun ./<executable>
Also, SLURM command "srun" does not seem to work properly when used within MPI context with the following error. So, don't use srun for parallel jobs.
Top Command
Use squeue command to know where your job is running:
squeue -u <CaseID>
output:
217xxxx smp ixxx <caseID> R 21:23:52 1 smp05t
So, the job is running in smp05t
Now, let's check how much percent of CPUs and memory the job is currently using. Note that you can only use this command if your job is running in that node.
ssh -t smp05 top
output:
top - 10:40:15 up 21:30, 1 user, load average: 1.13, 1.18, 1.20
Tasks: 873 total, 2 running, 871 sleeping, 0 stopped, 0 zombie
Cpu(s): 2.5%us, 0.0%sy, 0.0%ni, 97.5%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 1058718920k total, 612202016k used, 446516904k free, 139584k buffers
Swap: 8388604k total, 0k used, 8388604k free, 64442148k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
7183 <caseID> 20 0 514g 514g 1280 R 99.8 51.0 1288:02 impute2
....
Here, the job impute2 is using 100% CPU (serial job) and 51% of the total memory 1058718920k.
Press ctrl + c to exit from top termina.
Pending/Blocked Job Status
Sometimes, you may be wondering about why your job is in a queue or in a batch hold status. You may have exceeded the resources. Check your job using:
scontrol show job <Job ID>
output:
...
JobState=PENDING Reason=ReqNodeNotAvail(Unavailable:gpu017t,gpu018t,gpu019t,gpu020t,gpu021t,gpu022t,gpu023t,gpu024t) Dependency=(null)
Here, it shows that the job is waiting for the resources. The gpu nodes are listed because they are currently offline. For more information, refer to access policies.
See the start time and end time of the job
squeue -u <CaseID> -o "%.9i %.9P %.8j %.8u %.2t %.10M %.6D %S %e"
output:
JOBID PARTITION NAME USER ST TIME NODES START_TIME END_TIME
676101 batch JOB sxg125 PD 0:00 1 2016-04-09T15:25:21
606057 batch JOB sxg125 R 8-01:08:45 1 2016-03-31T14:17:02 2016-04-31T14:17:02
606056 batch JOB sxg125 R 8-01:10:16 1 2016-03-31T14:15:31 2016-03-31T14:15:31
The job 676101 is estimated to start on April 09 at 15:25 and the end time of job 606057 is April 31 at 14:17.
Email Notification
(Imp Note: If you have many small jobs, please refrain from using email notification. The postmaster will suffer mail congestion when this occurs.)
Rather than checking on your job interactively, you may want to receive the notification via email.
It is also possible to request email notification of job status from within the SLURM script. For example,
#SBATCH --mail-user=<email-address>
Requests email notification when the job ends
#SBATCH --mail-type=end
Note that these notifications are sent to your email account. Other options for notifications are: begin,fail,all
Node Status
Check the time left for the nodes to be in idle state. Note that timeleft on jobs does not of course mean the jobs will run that long, but it is an indicator.
squeue -O timeleft,nodelist | grep aisc
output:
7-05:29:16 aisct02
10-21:58:59 aisct03
4-21:03:01 aisct01
Sometimes, you may need to check the avail processors in the particular node.You can issue this command:
chk <node>
output:
NodeName=comp001t Arch=x86_64 CoresPerSocket=1
CPUAlloc=4 CPUErr=0 CPUTot=12 CPULoad=1.81 Features=hex24gb
Gres=(null)
NodeAddr=comp001t NodeHostName=comp001t Version=15.08
OS=Linux RealMemory=23000 AllocMem=20184 Sockets=12 Boards=1
State=MIXED ThreadsPerCore=1 TmpDisk=100000 Weight=1 Owner=N/A
BootTime=2015-11-24T15:51:21 SlurmdStartTime=2016-03-16T16:55:24
CapWatts=n/a
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
731986 batch IODis dch69 R 19:15:06 1 comp001t
732802 batch bash jrf16 R 27:22 1 comp001t
732867 batch bash crp68 R 11:24 1 comp001t
Partial Output & Temporary Job Files
The output (<jobname>.o<JobID>) file will be generated in the working directory as soon as the job runs. To view your partial output in your Job file, issue a command. You may need to wait to get the output as the copying of files may take place before the execution.
cat <jobname>.o<JobID>
The following example is for MATLAB job:
Your job is running in:
gpu022
< M A T L A B (R) >
Copyright 1984-2012 The MathWorks, Inc.
R2012b (8.0.0.783) 64-bit (glnxa64)
August 22, 2012
To get started, type one of these: helpwin, helpdesk, or demo.
For product information, visit www.mathworks.com.
z =
Columns 1 through 4
47.7333 + 0.7464i 13.7401 + 0.9445i 38.8252 - 2.3564i 41.8609 + 3.0742i
If you want to get the updated output, issue the following command. Press Ctrl + C to exit
tail -f <jobname>.o<JobID>
Deleting Jobs
To delete a job from the queue, or to kill a job that is already running, use the following command on the login node:
scancel -i <JobID>
Kill Multiple Jobs associated with your CaseID using the command:
scancel -u <caseID>