Submitting jobs to SCG3On SCG3: By default, jobs have a memory limit of 3.7GB (per slot) and jobs in the standard queue have a runtime limit of 6 hours (wallclock, not CPU time). Typical qsub commandNOTE: qsub does not accept commands directly (except from STDIN). So you should always submit a shell script. Also, the $PATH is not exported to the node running the job so use ABSOLUTE PATHS to any non-standard utilities you might be using such as R or Rscript or MATLAB. Make no assumptions about what will be in the path. qsub -q [queue] -w e -V -N [job_name] -l h_vmem=[memory] -l h_rt=[time] -l s_rt=[time] -pe shm [n_processors] -o [outputlogfile] -e [errorlogfile] [pathtoScript] [arg1] [arg2] In a shell script, you can set these options in lines that begin with #$ or pass them along with the qsub command -q <queue> --- set the queue-V --- will pass all environment variables to the job-v var[=value] --- will specifically pass environment variable 'var' to the job-b y --- allow command to be a binary file instead of a script-w e --- verify options and abort if there is an error-N <jobname> : name of the job-l h_vmem=size --- specify the amount of memory required (e.g. 3G or 3500M) (NOTE: This is memory per processor slot. So if you ask for 2 processors total memory will be 2 X hvmem_value)-l h_rt=hh:mm:ss --- specify the maximum run time (hours, minutes and seconds)-l s_rt=hh:mm:ss --- specify the soft run time limit (hours, minutes and seconds) - Remember to set both s_rt and h_rt-pe shm <n_processors> --- run a parallel job using pthreads or other shared-memory API-cwd : Move to current working directory-m ea :Will send email when job ends or aborts-P <projectName> --- set the job's project-M <emailaddress> :Email address to send email to-t <start>-<end>:<incr> : submit a job array with start index <start>, stop index <end> in increments using <incr>-hold_jid <comma separated list of job-ids, can also be a job id pattern such as 2722*> : will start the current job/job -array only after completion of all jobs in the comma separated list-hold_jid_ad <job array id, pattern or name>: will start the current job in a job array only after completion of corresponding job in the job array in <>The index numbers will be exported to the job tasks via the environment variable $SGE_TASK_ID. The option arguments n, m and s will be available through the environment variables$SGE_TASK_FIRST, $SGE_TASK_LAST and $SGE_TASK_STEPSIZE.Other Options [-a date_time] request a start time
[-ac context_list] add context variable(s)
[-ar ar_id] bind job to advance reservation
[-A account_string] account string in accounting record
[-binding [env|pe|set] exp|lin|str] binds job to processor cores
[-c n s m x] define type of checkpointing for job
n no checkpoint is performed.
s checkpoint when batch server is shut down.
m checkpoint at minimum CPU interval.
x checkpoint when job gets suspended.
<interval> checkpoint in the specified time interval.
[-ckpt ckpt-name] request checkpoint method
[-clear] skip previous definitions for job
[-C directive_prefix] define command prefix for job script
[-dc simple_context_list] delete context variable(s)
[-dl date_time] request a deadline initiation time
[-h] place user hold on job
[-hard] consider following requests "hard"
[-help] print this help
[-i file_list] specify standard input stream file(s)
[-js job_share] share tree or functional job share
[-jsv jsv_url] job submission verification script to be used
[-masterq wc_queue_list] bind master task to queue(s)
[-notify] notify job before killing/suspending it
[-now y[es]|n[o]] start job immediately or not at all
[-p priority] define job's relative priority
[-R y[es]|n[o]] reservation desired
[-r y[es]|n[o]] define job as (not) restartable
[-sc context_list] set job context (replaces old context)
[-shell y[es]|n[o]] start command with or without wrapping <loginshell> -c
[-soft] consider following requests as soft
[-sync y[es]|n[o]] wait for job to end and return exit code
[-S path_list] command interpreter to be used
[-verify] do not submit just verify
[-w e|w|n|v|p] verify mode (error|warning|none|just verify|poke) for jobs
[-@ file] read commandline input from fileJob queuesThere are a number of job scheduling queues, each configured with different resource restrictions. In many cases the job scheduler will automatically select the appropriate queue based on the resources required by your job, but you can also specifically request a queue using qsub's "-q" option.
You can force a job to run on a particular node by: Specifying both a queue name and a node name: qsub -q standard@scg1-2-10 myscriptSpecifying a node name via the "-l hostname=" option: qsub -l hostname=scg1-2-10 myscriptFor interactive jobsIf you want your job to continue running even after u log out. Use the screen command to create a new screenscreen -S <screen_name>To temporarily leave the screen back to the main terminal use Cntrl a + d To obtain the list of open screen ids use screen -listTo return to the screen use screen -r <screen_id>(The <screen_id> is different from the <screen_name> Type qlogin <resource options> to get into an interactive shell qlogin <resouce_options>Temporary filesOn SCG3, the local nodes often do not have a large amount of temporary space. So you should make sure your code is using a temporary directory with sufficient disk space. SCG3 has 100TB of scratch space at /srv/gsfs0/scratch that you can use for temporary filesYou can usually set the TMP environment variable in your .bashrc or in your submit script mkdir /srv/gsfs0/scratch/<yourusername>export TMP=/srv/gsfs0/scratch/<yourusername>$TMPDIR environment variable to get the proper location for the temporary files (although many programs do this automatically). If the program accesses input data in a non-sequential order then sometimes it will run faster if you copy the input data to $TMPDIR, run the program and store the results back in $TMPDIR and then copy the results back to the shared storage when it is done. However be careful not to fill up the local disk. Currently most nodes have about 200GB of local space. SGE automatically removes the contents of $TMPDIR when your job is done.rm -rf ${TMP_DIR}Configuring array task dependencies(Taken from http://wikis.sun.com/display/gridengine62u2/How+to+Configure+Array+Task+Dependencies+From+the+Command+Line) Examples – Using Job Dependencies Versus Array Task Dependencies to Complete Array JobsThe following example illustrates the difference between the job dependency facility and the task array dependency facility:
Examples – Using Array Task Dependencies to Chunk TasksWhen using 3D rendering applications, it is often more efficient to render several frames at once on the same CPU instead of distributing the frames across several machines. The generation of several frames at once we will refer to as chunking.
The following examples illustrate chunking:
Passing arguments to shell script submitted through qsub>qsub <script_name> <arg1> <arg2>The script <script_name> can access these arguments via $1, $2 and so on. Number of arguments is given by $# Monitoring Jobs and the ClusterKilling a jobqdel job-id to kill or cancel a running or pending job. qdel -u <username>to kill all jobs for a particular user (usually yourself unless you are root) Check Status of job(s)To see a list of all your pending and currently-running jobs, use the qstat command: qstatTo see details about a particular job, use qstat with the -j option. The jobid is the ID number of the job reported by qsub and qstat: qstat -j jobidTo see why a pending job has not been scheduled yet, look at the output from "qstat -j jobid". By default qstat only shows your own jobs. To see all the jobs on the cluster: qstat -u \*qstat -f -j JOBID | grep vmem # -f means qstat in full mode. Gives detailsGetting Information About Old JobsOn scg3, you must run 'qacct' on scg3-hn01 instead of "scg3" the login node because the qmaster and the accounting file are now on a separate machine from the login machine. To see a list of your recently-completed jobs: qstat -s zTo see detailed information about a job after it has finished or aborted you must use qacct. "qstat -j jobid" will say the job does not exist. The qacct command also takes the -j option: qacct -j jobidTo see your (or another user's) resource usage history over the past 2 days: qacct -o user -d 2To see your group (or another group's) resource usage history over the past 2 days: qacct -P project -d 2To check the status of the compute nodes: qhostTo check the number of slots and amount of memory available: qstat -F slots,h_vmemTo check the number of "tickets" assigned to each job by the scheduler: qstat -ext -u \* |
