Troubleshooting

Offline nodes

Both the Amarel and Caliburn systems comprise many compute nodes and nodes become unavailable (usually only temporarily) for many different reasons. To see a list of all offline nodes: sinfo --list-reasons

Below are some common reasons why nodes may be offline:

When our infrastructure team removes a node from service, they enter a brief explanation. Examples include,

  • Needs service

  • Epilog error

  • Hardware failure

  • Bad hard drive

  • IB problem (indicates problem with infiniband network)

  • Testing

  • Dead (the node is not repairable and offline permanently)

  • Reserved: Maintenance (the node is reserved for maintenance)

If SLURM automatically takes a node offline, you may see reasons such as:

  • not responding (SLURM can't communicate with node)

  • low RealMemory (the node isn't reporting correct amount of RAM)

  • gres/gpu count too low (indicates a problem with a GPU node not reporting the correct number of GPUs)

We run a "Node Health Check" (NHC) script that verifies a node's readiness to run jobs. If a node fails any of the tests, the NHC script will set the node offline with reasons that include,

  • NHC: check_ps_cpu (a runaway process on a node)

  • NHC: check_fs (one or more of the network filesystems are not mounted correctly or a directory is near capacity)

  • NHC: nv_health (a problem with an NVIDIA GPU)

  • NHC: check_hw_ib (an InfiniBand problem)

  • NHC: check_ps_daemon (a problem with the authentication service)

  • NHC: check_hw_physmem (a problem with the amount of RAM being reported)

  • NHC: check_cmd_output (a problem with output from a test command)