Monitoring & Deleting Jobs

SLURM Commands

On the login-node, to get a status report on all jobs that have been submitted as a SLURM job but not yet completed., use either of these commands. use --help for available options.

 squeue --help

 scontrol --help  sstat --help

Job Status

For brief status of your jobs, use the command:

squeue -u <caseID>

output:        

   JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)

  661587     batch     bash   sxg125  R      22:21      1 comp150t

Note the jobID (661587), status of the Job (R-> Running) and the compute node (comp150t) that the job is running.

If you want to check your group allocation and the resources used by other members in the group, use the information (i) command:

i

output:

****Your SLURM's CPU Quota****

                 xxx      256 

****Your Current Jobs****

   JOBID PRIOR   ST     ACCOUNT  PARTITION NODES CPU MIN_MEMORY TIME_LIMIT NODELIST

 1931308  1012    R         xxx     batch     3  36        72K 5-00:00:00 comp208t,comp209t,comp210t

 1935896  1004    R         xxx      batch     1  12        24K 2-12:00:00 comp186t

 1935867  1003    R         xxx      batch     1   6        12K 2-12:00:00 comp050t

 1934798  1003    R         xxx      batch     1   6        12K 2-12:00:00 comp049t

****Group's Jobs****

Account:yxk

   JOBID       USER PRIOR   ST  PARTITION NODES CPU MIN_MEMORY TIME_LIMIT NODELIST

Here, the group can run upto 256 processors. The members in the group have already used 60 processors (36 + 12 + 6 + 6) out of the allocation.

If you would like a complete detail about your job such as on which node it is running, how much physical memory it is consuming, and so on, use the command below. You may also want to use "top" command described in the section "Top Command" below:

sstat -p --format=AveCPU,AvePages,AveRSS,MaxRSSNode,AveVMSize,NTasks,JobID -j <jobID>

output:

AveCPU|AvePages|AveRSS|MaxRSSNode|AveVMSize|NTasks|JobID|

00:00.000|0|2264K|comp150t|119472K|1|661587.0|

RSS is the portion of memory occupied by a process that is held in main memory (RAM). The job has currently used 2264K of RAM (physical memory) and it is running on compute node comp150t.

Very Important: If you are submitting the job using sbatch, please include srun before your executable in your SLURM batch script as showed:

srun ./<executable>

Also, SLURM command "srun" does not seem to work properly when used within MPI context with the following error. So, don't use srun for parallel jobs.

Top Command

Use squeue command to know where your job is running:

squeue -u <CaseID>

output:

217xxxx       smp ixxx    <caseID>  R   21:23:52      1 smp05t

So, the job is running in smp05t

Now, let's check how much percent of CPUs and memory the job is currently using. Note that you can only use this command if your job is running in that node.

ssh -t smp05 top

output:

top - 10:40:15 up 21:30,  1 user,  load average: 1.13, 1.18, 1.20

Tasks: 873 total,   2 running, 871 sleeping,   0 stopped,   0 zombie

Cpu(s):  2.5%us,  0.0%sy,  0.0%ni, 97.5%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st

Mem:  1058718920k total, 612202016k used, 446516904k free,   139584k buffers

Swap:  8388604k total,        0k used,  8388604k free, 64442148k cached

   PID  USER         PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                          

  7183  <caseID>     20   0  514g 514g 1280 R 99.8 51.0   1288:02 impute2    

....

Here, the job impute2 is using 100% CPU (serial job) and 51% of the total memory 1058718920k.

Press ctrl + c to exit from top termina.

        

Pending/Blocked Job Status

Sometimes, you may be wondering about why your job is in a queue or in a batch hold status. You may have exceeded the resources. Check your job using:

scontrol show job <Job ID>

output:

...

JobState=PENDING Reason=ReqNodeNotAvail(Unavailable:gpu017t,gpu018t,gpu019t,gpu020t,gpu021t,gpu022t,gpu023t,gpu024t) Dependency=(null)

Here, it shows that the job is waiting for the resources. The gpu nodes are listed because they are currently offline. For more information, refer to  access policies.

See the start time and end time of the job

squeue -u <CaseID> -o "%.9i %.9P %.8j %.8u %.2t %.10M %.6D %S %e"

output:

   

JOBID PARTITION     NAME     USER ST       TIME  NODES START_TIME END_TIME

   676101     batch      JOB   sxg125 PD       0:00      1 2016-04-09T15:25:21 

   606057     batch      JOB   sxg125  R 8-01:08:45      1 2016-03-31T14:17:02 2016-04-31T14:17:02

   606056     batch      JOB   sxg125  R 8-01:10:16      1 2016-03-31T14:15:31 2016-03-31T14:15:31

The job 676101 is estimated to start on April 09 at 15:25 and the end time of job 606057 is April 31 at 14:17.

Email Notification

(Imp Note: If you have many small jobs, please refrain from using email notification. The postmaster will suffer mail congestion when this occurs.)

Rather than checking on your job interactively, you may want to receive the notification via email.

It is also possible to request email notification of job status from within the SLURM script. For example,

#SBATCH --mail-user=<email-address>

Requests email notification when the job ends

#SBATCH --mail-type=end

Note that these notifications are sent to your email account. Other options for notifications are: begin,fail,all

Node Status

Check the time left for the nodes to be in idle state.  Note that timeleft on jobs does not of course mean the jobs will run that long, but it is an indicator.

squeue -O timeleft,nodelist | grep aisc

output:

7-05:29:16          aisct02             

10-21:58:59         aisct03             

4-21:03:01          aisct01    

Sometimes, you may need to check the avail processors in the particular node.You can issue this command:

chk <node>

output:

NodeName=comp001t Arch=x86_64 CoresPerSocket=1

   CPUAlloc=4 CPUErr=0 CPUTot=12 CPULoad=1.81 Features=hex24gb

   Gres=(null)

   NodeAddr=comp001t NodeHostName=comp001t Version=15.08

   OS=Linux RealMemory=23000 AllocMem=20184 Sockets=12 Boards=1

   State=MIXED ThreadsPerCore=1 TmpDisk=100000 Weight=1 Owner=N/A

   BootTime=2015-11-24T15:51:21 SlurmdStartTime=2016-03-16T16:55:24

   CapWatts=n/a

   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0

   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

   

            731986     batch    IODis    dch69  R   19:15:06      1 comp001t

            732802     batch     bash    jrf16  R      27:22      1 comp001t

            732867     batch     bash    crp68  R      11:24      1 comp001t

Partial Output & Temporary Job Files

The output (<jobname>.o<JobID>) file will be generated in the working directory as soon as the job runs. To view your partial output in your Job file, issue a command. You may need to wait to get the output as the copying of files may take place before the execution.

cat <jobname>.o<JobID>

The following example is for MATLAB job:

Your job is running in:

gpu022

                            < M A T L A B (R) >

                  Copyright 1984-2012 The MathWorks, Inc.

                    R2012b (8.0.0.783) 64-bit (glnxa64)

                              August 22, 2012

To get started, type one of these: helpwin, helpdesk, or demo.

For product information, visit www.mathworks.com.

z =

  Columns 1 through 4

  47.7333 + 0.7464i  13.7401 + 0.9445i  38.8252 - 2.3564i  41.8609 + 3.0742i

If you want to get the updated output, issue the following command. Press Ctrl + C to exit

tail -f <jobname>.o<JobID>

Deleting Jobs

To delete a job from the queue, or to kill a job that is already running, use the following command on the login node:

 scancel -i <JobID>

Kill Multiple Jobs associated with your CaseID using the command:

scancel -u <caseID>