Monitoring & Deleting Jobs

SLURM Commands

On the login-node, to get a status report on all jobs that have been submitted as a SLURM job but not yet completed., use either of these commands. use --help for available options.

squeue --help

scontrol --help sstat --help

Job Status

For brief status of your jobs, use the command:

squeue -u <caseID>

output:

JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)

661587 batch bash sxg125 R 22:21 1 comp150t

Note the jobID (661587), status of the Job (R-> Running) and the compute node (comp150t) that the job is running.

If you want to check your group allocation and the resources used by other members in the group, use the information (i) command:

output:

****Your SLURM's CPU Quota****

xxx 256

****Your Current Jobs****

JOBID PRIOR ST ACCOUNT PARTITION NODES CPU MIN_MEMORY TIME_LIMIT NODELIST

1931308 1012 R xxx batch 3 36 72K 5-00:00:00 comp208t,comp209t,comp210t

1935896 1004 R xxx batch 1 12 24K 2-12:00:00 comp186t

1935867 1003 R xxx batch 1 6 12K 2-12:00:00 comp050t

1934798 1003 R xxx batch 1 6 12K 2-12:00:00 comp049t

****Group's Jobs****

Account:yxk

JOBID USER PRIOR ST PARTITION NODES CPU MIN_MEMORY TIME_LIMIT NODELIST

Here, the group can run upto 256 processors. The members in the group have already used 60 processors (36 + 12 + 6 + 6) out of the allocation.

If you would like a complete detail about your job such as on which node it is running, how much physical memory it is consuming, and so on, use the command below. You may also want to use "top" command described in the section "Top Command" below:

sstat -p --format=AveCPU,AvePages,AveRSS,MaxRSSNode,AveVMSize,NTasks,JobID -j <jobID>

output:

00:00.000|0|2264K|comp150t|119472K|1|661587.0|

RSS is the portion of memory occupied by a process that is held in main memory (RAM). The job has currently used 2264K of RAM (physical memory) and it is running on compute node comp150t.

Very Important: If you are submitting the job using sbatch, please include srun before your executable in your SLURM batch script as showed:

srun ./<executable>

Also, SLURM command "srun" does not seem to work properly when used within MPI context with the following error. So, don't use srun for parallel jobs.

Top Command

Use squeue command to know where your job is running:

squeue -u <CaseID>

output:

217xxxx smp ixxx <caseID> R 21:23:52 1 smp05t

So, the job is running in smp05t

Now, let's check how much percent of CPUs and memory the job is currently using. Note that you can only use this command if your job is running in that node.

ssh -t smp05 top

output:

top - 10:40:15 up 21:30, 1 user, load average: 1.13, 1.18, 1.20

Tasks: 873 total, 2 running, 871 sleeping, 0 stopped, 0 zombie

Cpu(s): 2.5%us, 0.0%sy, 0.0%ni, 97.5%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st

Mem: 1058718920k total, 612202016k used, 446516904k free, 139584k buffers

Swap: 8388604k total, 0k used, 8388604k free, 64442148k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

7183 <caseID> 20 0 514g 514g 1280 R 99.8 51.0 1288:02 impute2

....

Here, the job impute2 is using 100% CPU (serial job) and 51% of the total memory 1058718920k.

Press ctrl + c to exit from top termina.

Pending/Blocked Job Status

Sometimes, you may be wondering about why your job is in a queue or in a batch hold status. You may have exceeded the resources. Check your job using:

scontrol show job <Job ID>

output:

...

JobState=PENDING Reason=ReqNodeNotAvail(Unavailable:gpu017t,gpu018t,gpu019t,gpu020t,gpu021t,gpu022t,gpu023t,gpu024t) Dependency=(null)

Here, it shows that the job is waiting for the resources. The gpu nodes are listed because they are currently offline. For more information, refer to access policies.

See the start time and end time of the job

squeue -u <CaseID> -o "%.9i %.9P %.8j %.8u %.2t %.10M %.6D %S %e"

output:

JOBID PARTITION NAME USER ST TIME NODES START_TIME END_TIME

676101 batch JOB sxg125 PD 0:00 1 2016-04-09T15:25:21

606057 batch JOB sxg125 R 8-01:08:45 1 2016-03-31T14:17:02 2016-04-31T14:17:02

606056 batch JOB sxg125 R 8-01:10:16 1 2016-03-31T14:15:31 2016-03-31T14:15:31

The job 676101 is estimated to start on April 09 at 15:25 and the end time of job 606057 is April 31 at 14:17.

Email Notification

(Imp Note: If you have many small jobs, please refrain from using email notification. The postmaster will suffer mail congestion when this occurs.)

Rather than checking on your job interactively, you may want to receive the notification via email.

It is also possible to request email notification of job status from within the SLURM script. For example,

#SBATCH --mail-user=<email-address>

Requests email notification when the job ends

#SBATCH --mail-type=end

Note that these notifications are sent to your email account. Other options for notifications are: begin,fail,all

Node Status

Check the time left for the nodes to be in idle state. Note that timeleft on jobs does not of course mean the jobs will run that long, but it is an indicator.

squeue -O timeleft,nodelist | grep aisc

output:

7-05:29:16 aisct02

10-21:58:59 aisct03

4-21:03:01 aisct01

Sometimes, you may need to check the avail processors in the particular node.You can issue this command:

chk <node>

output:

NodeName=comp001t Arch=x86_64 CoresPerSocket=1

CPUAlloc=4 CPUErr=0 CPUTot=12 CPULoad=1.81 Features=hex24gb

Gres=(null)

NodeAddr=comp001t NodeHostName=comp001t Version=15.08

OS=Linux RealMemory=23000 AllocMem=20184 Sockets=12 Boards=1

State=MIXED ThreadsPerCore=1 TmpDisk=100000 Weight=1 Owner=N/A

BootTime=2015-11-24T15:51:21 SlurmdStartTime=2016-03-16T16:55:24

CapWatts=n/a

CurrentWatts=0 LowestJoules=0 ConsumedJoules=0

ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

731986 batch IODis dch69 R 19:15:06 1 comp001t

732802 batch bash jrf16 R 27:22 1 comp001t

732867 batch bash crp68 R 11:24 1 comp001t

Partial Output & Temporary Job Files

The output (<jobname>.o<JobID>) file will be generated in the working directory as soon as the job runs. To view your partial output in your Job file, issue a command. You may need to wait to get the output as the copying of files may take place before the execution.

cat <jobname>.o<JobID>

The following example is for MATLAB job:

Your job is running in:

gpu022

< M A T L A B (R) >

R2012b (8.0.0.783) 64-bit (glnxa64)

August 22, 2012

To get started, type one of these: helpwin, helpdesk, or demo.

For product information, visit www.mathworks.com.

z =

Columns 1 through 4

47.7333 + 0.7464i 13.7401 + 0.9445i 38.8252 - 2.3564i 41.8609 + 3.0742i

If you want to get the updated output, issue the following command. Press Ctrl + C to exit

tail -f <jobname>.o<JobID>

Deleting Jobs

To delete a job from the queue, or to kill a job that is already running, use the following command on the login node:

scancel -i <JobID>

Kill Multiple Jobs associated with your CaseID using the command:

scancel -u <caseID>

Page updated

Report abuse