Cluster partitions

Partitions

There are many ways of dividing up and managing a cluster. Partitions are a means of dividing hardware and nodes into useful groupings. These hardware groups can have very different parameters assigned to them. Currently Coeus is divided into four general CPU node partitions, one aggregate CPU partition, a Intel Phi processor partition, and a large memory partition. Note that these partitions and parameters may change in the future as demand requires.

  • short - jobs are limited to 60 minutes.

  • medium - this is the default partition. If you don’t specify a partition, your job will run here. Jobs are limited to 4 days.

  • long - allows long running jobs up to 20 days.

  • interactive - allows interactive jobs. This can be useful for remote visualization tasks and interactive applications. Jobs can be up to 2 days.

  • himem - large memory and Tesla V100 GPU nodes. Jobs can be up to 20 days.

  • phi - phi processor nodes. Jobs can be up to 20 days.

  • gpu - RTX A5000 and A40 GPU nodes. Jobs can be up to 20 days.

The sinfo command will display a listing similar to the following. In this view,10 "gpu" partition nodes are in use and "phi" partition nodes are drained (down) for maintenance.

[~]$ sinfo

PARTITION AVAIL TIMELIMIT NODES STATE NODELIST

gpu up 20-00:00:0 10 mix gpu[01-10]

medium* up 4-00:00:00 3 mix compute[033-035]

medium* up 4-00:00:00 93 idle compute[001-032,036-096]

long up 20-00:00:0 30 idle compute[097-126]

short up 4:00:00 2 idle compute[127-128]

allcpu up 4:00:00 3 mix compute[033-035]

allcpu up 4:00:00 125 idle compute[001-032,036-128]

himem up 20-00:00:0 2 alloc himem[01-02]

phi up 20-00:00:0 1 drain* phi01

phi up 20-00:00:0 11 drain phi[02-12]