FAQ: Running Jobs

This site helps to answer queries regarding running your jobs in HPC System.

What are modules?

Simply stated, a module is a packaging of definitions of environment variables into a script. In general, there is a module defined for each application, and that module defines the environment appropriately for that application. To use the application, you need to load the module first. Refer to HPC Module.

What are the important module commands?

Refer to Lmod Commands.

Can I run my processes directly on the login node?

We disallow users from running jobs directly on the login node. Refer to Job Scheduling for more information.

How can I run my executable directly from the node?

You can do that in an interactive mode. Request the compute nodes using the command below and run your executable from the prompt after the node is assigned to you.

Refer to Interactive Session.

How do I run a job on the cluster?

In order to run a program on processors in one or more of the compute nodes, and enjoy exclusive use of those processors, you need to submit your program as a job to the batch queue using "sbatch" command. For more information, visit Job Scheduling.

What commands can be put into a SLURM script?

In general, a SLURM script is similar to bash script that contains SLURM directives (#SBATCH) to request resources for the job, file manipulations commands as a part of the job, and execution parts for running one or more programs that constitute the job. Please refer to HPC Batch & Interactive Job.

How do I run graphical software/application in the cluster?

Cluster is not designed for Graphic intensive jobs, however, you can still run graphical jobs. The ITS HPC cluster uses XForwarding over ssh to provide support for software with a graphical user interface (GUI). To use software with a GUI, the machine connecting to the cluster must provide an XServer (or allow tunneling capability via ssh). and the user will have to configure an ssh client to allow XForwarding. This step is dependent on the OS and ssh client in use. Refer to HPC Visual Access for more information.

How do I transfer files to and from the cluster?

There are several methods to transfer your files from the cluster to your local machine or vice-versa. Please refer to Transferring Files @HPC.

What should I do to run memory intensive jobs?

If your job exits prematurely or you got the error "slurmstepd: Exceeded job memory limit", you need to request enough memory. Each job is assigned 1GB per job by default. So, for your memory intensive jobs, you need to request memory using "mem" identifier. You can also estimate the memory using Memory Estimation using Valgrind Utility @HPC.

Can I change my Linux shell?

The default linux shell is Bash, but you can request to change it by contacting us at hpc-supportATcase.edu.

Can I run jobs from other storage locations in batch job?

You can do that but we strongly suggest copying the input files to scratch directory first. That way, the job would run faster as it is running on the cluster native storage, vs the mounted location, or even the home directory.

Which commands of a SLURM script is optional?

Almost every SLURM directives is optional. If you do not specify the name of the job, it will use the name of your submission file. If you do not specify the wall clock, the default is 10 hours. If you do not request specific nodes, the default is any 1 processor within the cluster. And the output/error files will be written to your working directory automatically as slurm.o<JobID>. However we encourage users to include some of these options to have more control over running jobs.

How do I know the status of my job after I've submitted it to the queue?

From the login node, get a status report on all jobs that have been submitted by referring to Monitoring Jobs.

How do I check the temporary output files on my running jobs?

If you use $TMPDIR or $PFSDIR, if you want to see where your program is running and see the output files in the middle of the execution, you can refer to Partial Output & Temporary Job Files.

How do I delete a job or jobs I've submitted to the queue?

Refer to Deleting Jobs.

How do I request specific nodes or processors for my job?

It is possible to request nodes based on their processor speeds and other characteristics. Note that the "-n" specification really refers to processing cores. For details, please refer to HPC Hardware and Batch and Interactive Job Submission.

I got the error regarding "illegal instruction" while installing software

Basically the code instruction is not supported by that architecture. So, try to request the nodes with the supported feature (e.g. -C "icosa256gb"). For details, please refer to HPC Hardware and Batch and Interactive Job Submission.

How do I request interactive nodes, processors, walltime, memory, queue for my job?

Refer to Interactive Job Submission.

Is it possible to request more than the walltime limit?

As a member, 320hrs is the limit. As a guest, 36 hours is the limit. If you have a valid reason to exceed this limitation, you can contact us.

Can I extend the walltime of my job at run time?

It is not possible to extend the walltime at run time by yourself, you need to request specifically the necessary changes by contacting us with your jobID.

My jobs are in a queue for a long time. Why?

Note that when the cluster is busy i.e. utilization is high, it is normal for your jobs to be in a queue waiting for the resources to be available. Still, you may want to try changing the SLURM resource parameters to match resources available in the cluster. There are cases when the users in your group may have used most of the resources allocated to your group. When the group is interested in becoming a member or to increase your shares, you can contact us. Note that members get priorities over guests, and members with more shares have lower queue time. Also, note that the walltime starts only when your jobs are in the "running" state.

My jobs are not running. They are in the pending queue. Why?

You have violated the maximum resources (wallclock time, number of processors, etc.) available to you as a guest or member. If you want to get more resources , go to the next section "How can I change my cluster status from guest to member?" Else refer to Account Types . You may also want to refer to Monitoring & Deleting Jobs .

How can I change my cluster status from guest to member?

The faculty sponsor of the research group needs to directly contact us for possible options. The difference in the resource availability for guests and members along with membership options are highlighted in this section: Account Types.

How do I run jobs using GPU processors?

We have reserved GPU nodes in a separate queue. So, refer to Batch and Interactive Job for instruction. You can also find GPU capable Software: NAMD, Amber, CUDA, PGI, etc. under Software Guide.

How do I alter my jobs without deleting and resubmitting?

You may sometimes realize that you have not requested proper resources (especially walltime). Please contact us with your job ID.

The job I submitted is listed as pending. What does this mean?

Your job would be placed in that state because the resource manager was repeatedly unable to satisfy resource requests. You should check your SLURM script to ensure that it does not request resources that are impossible to obtain. Please check Access Policy and monitor your job for violation information at Batch and Interactive Job. You can then delete your job and resubmit it.

If you're unable to determine the reason for the pending state, please contact us for assistance.

I could not delete my job, what should I do?

If the delete commands at Batch and Interactive Job does not work, email us.

How can I submit jobs as a secondary group account?

If you are affiliated with more than one group and want to run jobs using the account group that is not your default (primary) group, use the -A option in PBS script as showed:

#SBATCH -A <groupCaseID>

For details, visit Batch & Interactive Job submission.

How can I switch/change to a secondary group account?

If you are affiliated with more than one group and want to switch to a new group, use the newgrp command

newgrp <new-group-name>

Note: Check your current group and changed group using the command:

id -g -n

The job I started stops/terminates/fails in the middle

Please check the exit code status. You need to request sufficient resources for you job. In this particular case, mostly, it is due to insufficient memory. Request the required amount physical memory following Batch and Interactive Job. See HPC Resource View) to identify the nodes for the memory requirement.

For memory, you may sometimes get the message like "insufficient memory", "slurmstepd: error: task/cgroup: unable to add task[pid=10805] to memory cg '(null)' etc. Please request enough memory using --mem flag.

The job that runs in login node does not run in the compute node

All the packages that are availble in login-nodes may not be available in the compute nodes. You can ssh to rider and run the command from there. If you need that package to be installed in the compute nodes, please refer HPC Software installation Guide.

I ran into segmentation Fault while running my job

Please check the exit code status. If you run into Segmentation Fault, please refer to "Debugging Segmentation Faults @ HPC".

I want to know about the group allocation and group membes' resource usage

You can check your group allocation and the resources used by other members in your group by running the information command "i".

Why my jobs got re-started?

Job restarting usually means the job is assigned to the node, but the node somehow cannot run the job - so the job then got restarted (reassigned) on a different node. This happens usually when the node where the job is assigned becomes unresponsive.

Why am I getting exit code 2 error ?

srun: error: <node>: task 0: Exited with exit code 2

Make sure that you have the correct slurm format; it can be missing hyphens, extra space between hyphen and the flag (e.g. - n) etc.

Can I use crontab/cron tool to schedule a recurring job ?

We are not promoting cron jobs as the risk of a running away script can be substantial. However, you can use cron features to schedule recurring jobs in HPC. You can also use the cron job template at /usr/local/doc/CRON/cron.sbatch.

Can I access the compute node by typing rider.case.edu:<port-number> in the browser ?

Yes, this is possible through ssh port forwarding. For an example, SSH port forwarding command "ssh -N -L 9999:<ip-address>:9999 <CaseID>@rider.case.edu" opens a connection to the rider.case.edu jump server, and forwards any connection to port 9999 on the local machine to port 9999 on <ip-address>. For example, you can take a look at our Jupyter page on opening a tunnel for a Jupyter web access.

Why is my job in a completing state for a long time ?

The slurm command sinfo shows the node state "completing". The reason is available at https://slurm.schedmd.com/faq.html#comp. Please contact hpc-supportATcaseDOTedu.

I start getting errors mostly related to environment variables and OnDemand portal is not opening, what should I do?

This can happen if your ~/.bashrc file have PATH/LIBRARY environment variables defined that conflict with the existing installation. It can occur mostly with the installation of the Software (e.g. Conda).

You can recover your .bashrc and .bash_profile file by compying it from /etc/skel/.bashrc.

cp /etc/skel/.bashrc ~/.bashrc

cp /etc/skel/.bash_profile ~/.bash_profile

The job steps won't exit when the X11 setup fails with CG status

We don't recommend users to submit the job from the compute node after requesting a compute node. If your job workflow requires you to get the compute node and submit sbatch script to run the files on compute nodes separately, instead of creating the allocation with srun, start the interactive step inside an salloc allocation as showed in the example below:

salloc --x11 -N 16 -n 40 --time=300:00:00 srun -N 1 -n 1 --pty --overlap /bin/bash

Page updated

Report abuse