HPC Resource View

The resources are summarized at the link below with latest usage statistics on a 20 minute interval.

Important Notes:

- Please take the floor memory value when you do the memory conversion from MB to GB. The CPUs and Memory in the table are for each node in that node range.
- These resource views indicate node ranges having specific resources. Use the 'sinfo -O nodehost,features | grep <feature>' command in a shell to determine which specific nodes are available on that cluster with that feature. Use 'si' to list all nodes in cluster with features listed, and their current status.
- The max wall time is 320 hrs.
- The feature, gpufermi, is no longer available, please use gpuk40 or gpup100 instead.
- GPU2080 GPU nodes gput[045-052] and GPU2V100 nodes gput(057-062), and GPU4V100 nodes gput(053-056) do have SSD drive. Please use /tmp ($TMPDIR) as scratch space to use SSDs.
- RDS is mounted only in a few selected nodes compt[317-326]. Use "-C rds" to access the mounted files. To make use of all compute nodes, please follow (you can include them in the script):
  - # Create temporary scratch space
  - mkdir /scratch/users/<CaseID>
  - # Copy data from RDS to /scratch
  - ssh dtn2 "cp -r /mnt/rds/<rds name>/<folder1> /scratch/users/<CaseID>"
  - # Copy data from /scratch back to RDS
  - ssh dtn2 "cp -r /scratch/users/<CaseID>/<folder1> /mnt/rds/<rds name>/."

To get the Feature or constraint information, use

scontrol show node | grep Features

output

...

CPUAlloc=2 CPUErr=0 CPUTot=12 CPULoad=2.04 Features=hex24gb

...

CPUAlloc=2 CPUErr=0 CPUTot=12 CPULoad=2.00 Features=(null)

CPUAlloc=10 CPUErr=0 CPUTot=12 CPULoad=5.51 Features=hex48gb

...

CPUAlloc=3 CPUErr=0 CPUTot=16 CPULoad=3.00 Features=octa64gb

...

CPUAlloc=24 CPUErr=0 CPUTot=24 CPULoad=8.20 Features=dodeca96gb

It indicates that there are three sets of compute nodes with features: hex24gb, hex48gb, and octa64gb.

Now, if you want to reserve the whole 12 processors comp node with 22gb of memory, you can use, hex24gb keyword (hex=> 6 cores; 12 processors):

srun -n 12 -C hex24gb --mem=22gb --pty bash

For 16 processors comp node with 61gb of memory, use octa64gb (octa => 8 cores; 16 processors):

srun -n 16 -C octa64gb --mem=61gb --pty bash

For 20 processors comp node with 92gb of memory, use dodeca96gb

srun -n 20 -C dodeca96gb --mem=92gb --pty bash

You need to specify the queue type (e.g. gpu, smp) to use the resources available in nodes in those queue. For example, to request 32 processors in a node, you need to use the smp nodes in smp queue (-p smp).

srun -p smp -n 32 --mem=64gb --pty bash

Example:

SMP:

Request a smp node with 64gb of memory

srun -p smp -n 8 --mem=64gb --pty bash

Request a node exclusively (in default batch queue); --mem=0 needs to be included to request all the memory available in the node.

srun --exclusive --mem=0 --pty bash

It will reserve all the processors. If you need more memory or GPU resources, you need to request that explicitly.

Memory per CPU:

Memory per cpu are useful especially with MPI jobs. Let's request 4gb/cpu for MPI job using 80 processors.

srun -n 80 --mem-per-cpu=4gb --pty bash

Now, check the memory:

[abc123@smp05t ~]$ ulimit -a

output:

...

max memory size (kbytes, -m) 67108864

...

For more information, please visit HPC Guide to Interactive and Batch Job Submission that contains the section on memory intensive job.

GPU:

Most of the GPU nodes have 2 GPU cards in each server. You can get the information about GPU resources by using:

scontrol show node <gpu node> # e.g. gput045

output:

ActiveFeatures=gpu2080

Gres=gpu:2

Here, the gpu node gput045 can be requested using the feature gpu2080 (e.g. -p gpu -C gpu2080). Also, GPU Resource (gres) field shows 2 GPUs in gput045.

Request a gpuk40 node with 1 GPU

srun -p gpu -N 1 -n 10 -C gpuk40 --gres=gpu:1 --mem=4gb --pty bash

Note that though you are running a serial job you need to request 10 Processors which is mapped to 1GPU. Also, 4gb is memory is requested using --mem flag.

Reserve the gpup100 node exclusively with 2 GPUS (NOT working with --mem=0 for GPUs)

srun -p gpu -C gpup100 --gres=gpu:2 --exclusive --mem=0 --pty bash

Exclusive requests that include GPUs are a little buggy for various reasons. For these, you have to requested the memory explicitly rather than --mem=0 (see the example):

srun -p gpu -C gpup100 --gres=gpu:2 --exclusive --mem=185g --pty /bin/bash

All the nodes contain Intel Xeon processors (x86_64) and Red Hat operating system. To find the version, use the command:

cat /etc/redhat-release

For detail info on HPC Servers and Storage, Please visit Servers & Storage.

Request GPU nodes to use SSD space as a /tmp space:

srun -p gpu -C 'gpu2080|gpu2v100|gpu4v100' --gres=gpu:2

Request specific node or the list of nodes

Get the nodes in the node list (e.g. compt162-compt166)

srun --nodelist=compt[162-166] --pty bash

This will assign you one processor in each node in the list as showed.

<jobID> batch bash <User> R 0:09 9:59:51 5 5 1922 compt[162-166]

You can use -n option to select the number of processors in each node. For requesting nodes not in a range, use coma separated list (e.g --nodelist=compt162,compt166). You can also do it via input file (e.g. --nodelist ./node-file)

Exclude specific node or the list of nodes

Exclude few gpu nodes that have older version of driver

srun -p gpu -C "gpuk40|gpup100" --gres=gpu:1 --exclude=gput[026-028] --pty bash

Extended Instructions Sets and the corresponding Nodes

Submitting a batch job using 16-avx, or 16-avx2 requires excluding the nodes unable to support the instruction set, using the following will exclude the sse-enabled nodes, allowing avx-enabled nodes to accept the job:

#SBATCH --exclude ./exclude-sse.list

The k40 and p100 GPU nodes do also support avx instructions.

To request the features, use the flag "-C <feature>", e.g. -C gpuk40, as part of a slurm script, or as a flag to the srun command for establishing an interactive session on a compute node.

To confirm the extension support, please check the flags field by executing the following command from that node:

cat /proc/cpuinfo

output:

...

flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm ida arat epb pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc

Page updated

Report abuse