Tutorials‎ > ‎Programming‎ > ‎Cluster/Grid submission‎ > ‎

Sun Grid Engine

Submitting jobs to SCG3

On SCG3: By default, jobs have a memory limit of 3.7GB (per slot) and jobs in the standard queue have a runtime limit of 6 hours (wallclock, not CPU time).

Typical qsub command

NOTE: qsub does not accept commands directly (except from STDIN). So you should always submit a shell script. Also, the $PATH is not exported to the node running the job so use ABSOLUTE PATHS to any non-standard utilities you might be using such as R or Rscript or MATLAB. Make no assumptions about what will be in the path.

qsub -q [queue] -w e -V -N [job_name] -l h_vmem=[memory] -l h_rt=[time] -l s_rt=[time] -pe shm [n_processors] -o [outputlogfile] -e [errorlogfile] [pathtoScript] [arg1] [arg2]

In a shell script, you can set these options in lines that begin with #$ or pass them along with the qsub command

-q <queue> --- set the queue
-V --- will pass all environment variables to the job
-v var[=value]  --- will specifically pass environment variable 'var' to the job
-b y --- allow command to be a binary file instead of a script
-w e --- verify options and abort if there is an error
-N <jobname> : name of the job
-l h_vmem=size --- specify the amount of memory required (e.g. 3G or 3500M) (NOTE: This is memory per processor slot. So if you ask for 2 processors total memory will be 2 X hvmem_value)
-l h_rt=hh:mm:ss --- specify the maximum run time (hours, minutes and seconds)
-l s_rt=hh:mm:ss --- specify the soft run time limit (hours, minutes and seconds) - Remember to set both s_rt and h_rt
-pe shm <n_processors> --- run a parallel job using pthreads or other shared-memory API
-cwd : Move to current working directory
-wd <dir>  : Set working directory for this job
-j [y/n] : whether you want to merge output and error log files
-o <output_logfile>
-e <error_logfile>
-m ea :Will send email when job ends or aborts
-P <projectName> --- set the job's project
-M <emailaddress> :Email address to send email to
-t <start>-<end>:<incr> : submit a job array with start index <start>, stop index <end> in increments using <incr>
-hold_jid <comma separated list of job-ids, can also be a job id pattern such as 2722*> : will start the current job/job -array only after completion of all jobs in the comma separated list
-hold_jid_ad <job array id, pattern or name>: will start the current job in a job array only after completion of corresponding job in the job array in <>

The index numbers will be exported to the job tasks via the environment variable $SGE_TASK_ID. The option arguments n, m and s will be available through the environment variables

Other Options
  [-a date_time]                           request a start time
  [-ac context_list]                       add context variable(s)
  [-ar ar_id]                              bind job to advance reservation
  [-A account_string]                      account string in accounting record
  [-binding [env|pe|set] exp|lin|str]      binds job to processor cores
  [-c n s m x]                             define type of checkpointing for job
             n           no checkpoint is performed.
             s           checkpoint when batch server is shut down.
             m           checkpoint at minimum CPU interval.
             x           checkpoint when job gets suspended.
             <interval>  checkpoint in the specified time interval.
  [-ckpt ckpt-name]                        request checkpoint method
  [-clear]                                 skip previous definitions for job
  [-C directive_prefix]                    define command prefix for job script
  [-dc simple_context_list]                delete context variable(s)
  [-dl date_time]                          request a deadline initiation time
  [-h]                                     place user hold on job
  [-hard]                                  consider following requests "hard"
  [-help]                                  print this help
  [-i file_list]                           specify standard input stream file(s)
  [-js job_share]                          share tree or functional job share
  [-jsv jsv_url]                           job submission verification script to be used
  [-masterq wc_queue_list]                 bind master task to queue(s)
  [-notify]                                notify job before killing/suspending it
  [-now y[es]|n[o]]                        start job immediately or not at all
  [-p priority]                            define job's relative priority
  [-R y[es]|n[o]]                          reservation desired
  [-r y[es]|n[o]]                          define job as (not) restartable
  [-sc context_list]                       set job context (replaces old context)
  [-shell y[es]|n[o]]                      start command with or without wrapping <loginshell> -c
  [-soft]                                  consider following requests as soft
  [-sync y[es]|n[o]]                       wait for job to end and return exit code
  [-S path_list]                           command interpreter to be used
  [-verify]                                do not submit just verify
  [-w e|w|n|v|p]                           verify mode (error|warning|none|just verify|poke) for jobs
  [-@ file]                                read commandline input from file

Job queues

There are a number of job scheduling queues, each configured with different resource restrictions. In many cases the job scheduler will automatically select the appropriate queue based on the resources required by your job, but you can also specifically request a queue using qsub's "-q" option.
  • The test.q queue has a runtime limit of one hour and you can only run one job at a time. However, there is a dedicated node for these jobs, so generally they will be dispatched quickly.
    You can run up to one slot of test job, up to 1hr runtime. In order for your job to go into the test queue, you must specify "-l testq=1".
    e.g. qsub -l testq=1 test_script.sh
  • The standard queue has a runtime limit of six hours.
  • The extended queue has a runtime limit of seven days. Jobs in the extended queue may have to wait longer to be scheduled.
  • The large queue is a special queue for large-memory jobs (see Large-Memory Jobs). 
    You need to specify '-P large_mem' and also h_vmem > 50G or so.
    E.g. qsub -l h_vmem=10G -pe shm 16 -P large_mem test_script.sh 
    That will request 10G of h_vmem per slot, so 160G total.The large queue does not currently have a time limit, and the users of greenie tend to coordinate with each other about larger/longer jobs.
  • The seq_pipeline queue is a special queue for jobs related to the Center's sequencing pipeline.
You can force a job to run on a particular node by:
Specifying both a queue name and a node name: qsub -q standard@scg1-2-10 myscript
Specifying a node name via the "-l hostname=" option: qsub -l hostname=scg1-2-10 myscript

For interactive jobs

If you want your job to continue running even after u log out. Use the screen command to create a new screen
screen -S <screen_name>

To temporarily leave the screen back to the main terminal use Cntrl a + d

To obtain the list of open screen ids use
screen -list

To return to the screen use
screen -r <screen_id>
(The <screen_id> is different from the <screen_name>

Type qlogin <resource options> to get into an interactive shell

qlogin <resouce_options>

Temporary files

On SCG3, the local nodes often do not have a large amount of temporary space. So you should make sure your code is using a temporary directory with sufficient disk space. SCG3 has 100TB of scratch space at /srv/gsfs0/scratch that you can use for temporary files

You can usually set the TMP environment variable in your .bashrc or in your submit script
mkdir /srv/gsfs0/scratch/<yourusername>
export TMP=/srv/gsfs0/scratch/<yourusername>
If your job creates temporary files, store them on the local disk on the cluster node. You can read the $TMPDIR environment variable to get the proper location for the temporary files (although many programs do this automatically). If the program accesses input data in a non-sequential order then sometimes it will run faster if you copy the input data to $TMPDIR, run the program and store the results back in $TMPDIR and then copy the results back to the shared storage when it is done. However be careful not to fill up the local disk. Currently most nodes have about 200GB of local space. SGE automatically removes the contents of $TMPDIR when your job is done.

You can create a randomly named directory in $TMPDIR using the foll. lines in your submit script code
mkdir ${TMP_DIR}
<run everything>
rm -rf ${TMP_DIR}

Configuring array task dependencies

(Taken from http://wikis.sun.com/display/gridengine62u2/How+to+Configure+Array+Task+Dependencies+From+the+Command+Line)
Examples – Using Job Dependencies Versus Array Task Dependencies to Complete Array Jobs

The following example illustrates the difference between the job dependency facility and the task array dependency facility:

  • In the following example, array task B is dependent on array task A:
    $ qsub -t 1-3 A
    $ qsub -hold_jid A -t 1-3 B

    All the sub-tasks in job B will wait for all sub-tasks 1,2 and 3 in A to finish before starting the tasks in job B. The tasks will be executed in the following approximate order: A.1, A.2, A.3, B.1, B.2, B.3, as shown below:

    | A.1 |     | B.1 |
    | A.2 | --> | B.2 |
    | A.3 | | B.3 |
  • In the following example, each sub-task in array job B is dependent on each corresponding sub-task in job A in a one-to-one mapping:
    $ qsub -t 1-3 A
    $ qsub -hold_jid_ad A -t 1-3 B

    Sub-task B.1 will only start when A.1 completes. B.2 will only start once A.2 completes, etc. On a single machine renderfarm, the tasks thus could be executed in the following approximate order: A.1, B.1, A.2, B.2, A.3, B.3, as shown below:

    | A.1 | --> | B.1 |
    | A.2 | --> | B.2 |
    | A.3 | --> | B.3 |

    It should only be able to specify the option if we are submitting an array job, it is dependent on another array job, and that array job has the same number of sub-tasks.

Examples – Using Array Task Dependencies to Chunk Tasks

When using 3D rendering applications, it is often more efficient to render several frames at once on the same CPU instead of distributing the frames across several machines. The generation of several frames at once we will refer to as chunking.

When using the task dependency facility, the array task must have the same range of sub-tasks as its dependent array task, otherwise the job will be rejected at submit time.

The following examples illustrate chunking:

  • Array task B is dependent on array task A, which has a step size of 2:
    $ qsub -t 1-6:2 A
    $ qsub -hold_jid_ad A -t 1-6 B

    In the results shown below, it is assumed that array task A is chunking, which means that B.1 and B.2 are dependent on A.1, B.3 and B.4 are dependent on A.3, and so on. If job A.1 didn't render frame 2, then job B.2 would fail:

    | A.1 | --> | B.1 |
    | | --> | B.2 |
    | A.3 | --> | B.3 |
    | | --> | B.4 |
    | A.5 | --> | B.5 |
    | | --> | B.6 |
  • Array task B is dependent on array task A, which has a step size of 1:
    $ qsub -t 1-6 A
    $ qsub -hold_jid_ad A -t 1-6:2 B

    In this example shown below, array task B is chunking, which means that job B.1 is dependent on job A.1 and job A.2, job B.3 is dependent on job A.3 and job A.4, and so on. It is reasonable to always assume that array task B is chunking because otherwise A.2, A.4, and A.6 would be needlessly run and the result would never be used:

    | A.1 | --> | B.1 |
    | A.2 | --> | |
    | A.3 | --> | B.3 |
    | A.4 | --> | |
    | A.5 | --> | B.5 |
    | A.6 | --> | |
  • Array task A has a step size of 3 and array task B has a step size of 2. The tasks are dependent on each other:
    $ qsub -t 1-6:3 A
    $ qsub -hold_jid_ad A -t 1-6:2 B

    In this example shown below, both array task A and array task B are chunking. So, job B.1 is dependent on job A.1, job B.3 is dependent on job A.1 and job A.4, and job B.5 is dependent on job A.4. When the hold array dependency option -hold_jid_ad is specified and the step sizes of the array job and the dependent array job are different, we always assume that both are chunking:

    | A.1 | --> | B.1 |
    | | --> | |
    | | --> | B.3 |
    | A.4 | --> | |
    | | --> | B.5 |
    | | --> | |

Passing arguments to shell script submitted through qsub

>qsub <script_name> <arg1> <arg2>

The script <script_name> can access these arguments via $1, $2 and so on. Number of arguments is given by $#

Monitoring Jobs and the Cluster

Killing a job

qdel job-id 
to kill or cancel a running or pending job.

qdel -u <username>
to kill all jobs for a particular user (usually yourself unless you are root)

Check Status of job(s)

To see a list of all your pending and currently-running jobs, use the qstat command:

To see details about a particular job, use qstat with the -j option. The jobid is the ID number of the job reported by qsub and qstat:
qstat -j jobid
To see why a pending job has not been scheduled yet, look at the output from "qstat -j jobid".

By default qstat only shows your own jobs. To see all the jobs on the cluster:
qstat -u \*

To check memory used by a job
qstat -f -j JOBID | grep vmem # -f means qstat in full mode. Gives details

Getting Information About Old Jobs

On scg3, you must run 'qacct' on scg3-hn01 instead of "scg3" the login node because the qmaster and the accounting file are now on a separate machine from the login machine.

To see a list of your recently-completed jobs:
qstat -s z

To see detailed information about a job after it has finished or aborted you must use qacct. "qstat -j jobid" will say the job does not exist. The qacct command also takes the -j option:
qacct -j jobid

To see your (or another user's) resource usage history over the past 2 days:
qacct -o user -d 2

To see your group (or another group's) resource usage history over the past 2 days:
qacct -P project -d 2

To check the status of the compute nodes:

To check the number of slots and amount of memory available:
qstat -F slots,h_vmem

To check the number of "tickets" assigned to each job by the scheduler:
qstat -ext -u \*