This site helps to answer queries regarding running your jobs in HPC System.
Simply stated, a module is a packaging of definitions of environment variables into a script. In general, there is a module defined for each application, and that module defines the environment appropriately for that application. To use the application, you need to load the module first. Refer to HPC Module.
Refer to Lmod Commands.
We disallow users from running jobs directly on the login node. Refer to Job Scheduling for more information.
You can do that in an interactive mode. Request the compute nodes using the command below and run your executable from the prompt after the node is assigned to you.
Refer to Interactive Session.
In order to run a program on processors in one or more of the compute nodes, and enjoy exclusive use of those processors, you need to submit your program as a job to the batch queue using "sbatch" command. For more information, visit Job Scheduling.
In general, a SLURM script is similar to bash script that contains SLURM directives (#SBATCH) to request resources for the job, file manipulations commands as a part of the job, and execution parts for running one or more programs that constitute the job. Please refer to HPC Batch & Interactive Job.
Cluster is not designed for Graphic intensive jobs, however, you can still run graphical jobs. The ITS HPC cluster uses XForwarding over ssh to provide support for software with a graphical user interface (GUI). To use software with a GUI, the machine connecting to the cluster must provide an XServer (or allow tunneling capability via ssh). and the user will have to configure an ssh client to allow XForwarding. This step is dependent on the OS and ssh client in use. Refer to HPC Visual Access for more information.
There are several methods to transfer your files from the cluster to your local machine or vice-versa. Please refer to Transferring Files @HPC.
If your job exits prematurely or you got the error "slurmstepd: Exceeded job memory limit", you need to request enough memory. Each job is assigned 1GB per job by default. So, for your memory intensive jobs, you need to request memory using "mem" identifier. You can also estimate the memory using Memory Estimation using Valgrind Utility @HPC.
The default linux shell is Bash, but you can request to change it by contacting us at hpc-supportATcase.edu.
You can do that but we strongly suggest copying the input files to scratch directory first. That way, the job would run faster as it is running on the cluster native storage, vs the mounted location, or even the home directory.
Almost every SLURM directives is optional. If you do not specify the name of the job, it will use the name of your submission file. If you do not specify the wall clock, the default is 10 hours. If you do not request specific nodes, the default is any 1 processor within the cluster. And the output/error files will be written to your working directory automatically as slurm.o<JobID>. However we encourage users to include some of these options to have more control over running jobs.
From the login node, get a status report on all jobs that have been submitted by referring to Monitoring Jobs.
If you use $TMPDIR or $PFSDIR, if you want to see where your program is running and see the output files in the middle of the execution, you can refer to Partial Output & Temporary Job Files.
Refer to Deleting Jobs.
It is possible to request nodes based on their processor speeds and other characteristics. Note that the "-n" specification really refers to processing cores. For details, please refer to HPC Hardware and Batch and Interactive Job Submission.
Refer to Interactive Job Submission.
As a member, 320hrs is the limit. As a guest, 36 hours is the limit. If you have a valid reason to exceed this limitation, you can contact us.
It is not possible to extend the walltime at run time by yourself, you need to request specifically the necessary changes by contacting us with your jobID.
Note that when the cluster is busy i.e. utilization is high, it is normal for your jobs to be in a queue waiting for the resources to be available. Still, you may want to try changing the SLURM resource parameters to match resources available in the cluster. There are cases when the users in your group may have used most of the resources allocated to your group. When the group is interested in becoming a member or to increase your shares, you can contact us. Note that members get priorities over guests, and members with more shares have lower queue time. Also, note that the walltime starts only when your jobs are in the "running" state.
You have violated the maximum resources (wallclock time, number of processors, etc.) available to you as a guest or member. If you want to get more resources , go to the next section "How can I change my cluster status from guest to member?" Else refer to Account Types. You may also want to refer to Monitoring & Deleting Jobs .
The faculty sponsor of the research group needs to directly contact us for possible options. The difference in the resource availability for guests and members along with membership options are highlighted in this section: Account Types.
We have reserved GPU nodes in a separate queue. So, refer to Batch and Interactive Job for instruction. You can also find GPU capable Software: NAMD, Amber, CUDA, PGI, etc. under Software Guide.
You may sometimes realize that you have not requested proper resources (especially walltime). Please contact us with your job ID.
Your job would be placed in that state because the resource manager was repeatedly unable to satisfy resource requests. You should check your SLURM script to ensure that it does not request resources that are impossible to obtain. Please check Access Policy and monitor your job for violation information at Batch and Interactive Job. You can then delete your job and resubmit it.
If you're unable to determine the reason for the pending state, please contact us for assistance.
If the delete commands at Batch and Interactive Job does not work, email us.
If you are affiliated with more than one group and want to run jobs using the account group that is not your default (primary) group, use the -A option in PBS script as showed:
#SBATCH -A <groupCaseID>
For details, visit Batch & Interactive Job submission.
If you are affiliated with more than one group and want to switch to a new group, use the newgrp command
newgrp <new-group-name>
Note: Check your current group and changed group using the command:
id -g -n
Please check the exit code status. You need to request sufficient resources for you job. In this particular case, mostly, it is due to insufficient memory. Request the required amount physical memory following Batch and Interactive Job. See HPC Resource View) to identify the nodes for the memory requirement.
For memory, you may sometimes get the message like "insufficient memory", "slurmstepd: error: task/cgroup: unable to add task[pid=10805] to memory cg '(null)' etc. Please request enough memory using --mem flag.
All the packages that are availble in login-nodes may not be available in the compute nodes. You can ssh to rider and run the command from there. If you need that package to be installed in the compute nodes, please refer HPC Software installation Guide.
Please check the exit code status. If you run into Segmentation Fault, please refer to "Debugging Segmentation Faults @ HPC".
You can check your group allocation and the resources used by other members in your group by running the information command "i".
Job restarting usually means the job is assigned to the node, but the node somehow cannot run the job - so the job then got restarted (reassigned) on a different node. This happens usually when the node where the job is assigned becomes unresponsive.
srun: error: <node>: task 0: Exited with exit code 2
Make sure that you have the correct slurm format; it can be missing hyphens, extra space between hyphen and the flag (e.g. - n) etc.
We are not promoting cron jobs as the risk of a running away script can be substantial. However, you can use cron features to schedule recurring jobs in HPC. You can also use the cron job template at /usr/local/doc/CRON/cron.sbatch.
Yes, this is possible through ssh port forwarding. For an example, SSH port forwarding command "ssh -N -L 9999:<ip-address>:9999 <CaseID>@rider.case.edu" opens a connection to the rider.case.edu jump server, and forwards any connection to port 9999 on the local machine to port 9999 on <ip-address>. For example, you can take a look at our Jupyter page on opening a tunnel for a Jupyter web access.
The slurm command sinfo shows the node state "completing". The reason is available at https://slurm.schedmd.com/faq.html#comp. Please contact hpc-supportATcaseDOTedu.
This can happen if your ~/.bashrc file have PATH/LIBRARY environment variables defined that conflict with the existing installation. It can occur mostly with the installation of the Software (e.g. Conda).
You can recover your .bashrc and .bash_profile file by compying it from /etc/skel/.bashrc.
cp /etc/skel/.bashrc ~/.bashrc
cp /etc/skel/.bash_profile ~/.bash_profile
The job steps won't exit when the X11 setup fails with CG status
We don't recommend users to submit the job from the compute node after requesting a compute node. If your job workflow requires you to get the compute node and submit sbatch script to run the files on compute nodes separately, instead of creating the allocation with srun, start the interactive step inside an salloc allocation as showed in the example below:
salloc --x11 -N 16 -n 40 --time=300:00:00 srun -N 1 -n 1 --pty --overlap /bin/bash