SLURM Parallelism
What is Parallelism?
Parallelism on computers refers to doing several calculations simultaneously, or in parallel. HPC clusters enable parallelism by providing access to many different nodes and many cores within each node (for more on that for each of OIT-RC's systems, visit this page). OIT-RC uses SLURM to manage the jobs on each cluster, and SLURM has a handful of ways to specify how to run each job.
Parallelism with SLURM
Here are some relevant options SLURM has to let users control how their job are run. These are best to be specified as part of an #SBATCH in a submission script. To get the best use out of the systems, check the systems page for the hardware specifics available to it.
Be aware that SLURM considers each Phi hyper-thread as a core.
SLURM Flags
---ntasks <N_TASKS>
This flag will set the number of tasks to allocate (make available to use) to N_TASKS. The default is to use one task per node, where --cpus-per-task will alter this behaviour (since each task defaults to a single core).
Set mpiexec -n X's X to N_TASKS as well to tell it to run that many copies of the program. Alternatively, it is possible to send up to N_TASKS to the background (with &) in the sbatch script.
---nodes <MIN[-MAX]>
This flag controls the MIN or MAX nodes needed for a job. If MAX is not supplied, the MIN will be treated as the MIN and MAX. If there is not enough nodes in the selected partition type (specified by --partition), then the job will be left in the pending (PD) state until the resources become available.
--ntasks-per-node <TASKS>
This flag will set the number of tasks to use in each node to TASKS. If this flag is used with --ntasks, then this flag will serve to indicate the maximum tasks per node.
--cpus-per-task <N_CPUS>
This flag will set the number cores to use per task to N_CPUS. If this is used without --ntasks, then every selected node will have as many tasks allocated to it as possible, while obeying N_CPUS.
Vocabulary
What is a process?
A process is a running program. Firefox is an application, and when it is started, it creates a process of the running Firefox instance.
What is a task?
A task, or a SLURM task instead of a job array's task, is allocated space for a single process (this is a single core on SLURM).
What's the different between a core and a CPU?
A core is a single processing unit, whereas a CPU has several cores in it. OIT-RC's systems treat each core as CPU and each is independently allocable.
In layman's terms, what is a server, a cluster, and a node?
A server is someone else's computer that a user connects to over the web, also known as tunnelling into the server.
An example of this would be how in order to connect to Agamede, a user will have to tunnel into (or connect to) it with the below.
ssh <odin username>@agamede.rc.pdx.edu.
A cluster is a server with many sheltered servers in it that are not connected to the web, are extremely interconnected, and get all web interactions from one or two designated login servers.
An example of this would be how Coeus' login1 and login2 nodes can be connected to, but to connect to one of the compute nodes, a user would have to tunnel from the login node to the compute node since the compute node is not connected to the outside world. This can be verified with the below two steps, where the user tunnels into login1 on Coeus, and then from there tunnels into compute001.
ssh <odin username>@login1.coeus.rc.pdx.edu
ssh compute001
The different servers in a cluster are called nodes.
Examples
A lot of these options do sound extremely similar, but they will have different effects. Here are some examples. Any Python examples use a Virtual Environment to have access to mpi4py, and more information can be found here, but that is not the focus of this document.
It is recommended to use mpiexec instead of mpirun, they work extremely similarly but mpirun is the old version of mpiexec.
Allocating and using four tasks between two nodes with mpiexec
#!/bin/bash
#SBATCH --ntasks 4
#SBATCH --nodes 2
#SBATCH --output out.txt
module purge
module load Python/gcc/3.7.5/gcc-6.3.0
source ./mpiVirtualEnv/env/bin/activate
mpiexec -n 4 python3 ./helloMpi.py
This script will first allocate four tasks across two nodes and then fill all four available tasks with a process of helloMpi.py; in summary, helloMpi.py runs four times, using a minimum and maximum of two nodes.
Allocating and using four tasks without mpiexec
#!/bin/bash
#SBATCH --ntasks 4
#SBATCH --output out.txt
module purge
module load Python/gcc/3.7.5/gcc-6.3.0
source ./mpiVirtualEnv/env/bin/activate
python3 ./calculationsExecutable 50 &
python3 ./calculationsExecutable 75 &
python3 ./calculationsExecutable 80 &
python3 ./calculationsExecutable 100 &
This sbatch script will first allocate four tasks and then fill those four allocated tasks with their own process calculationsExecutable with inputs of either 50, 75, 80, or 100 by sending them to the background (with the &). Since each process is sent to the background to fill its own task, the next process can be started immediately, assuming there is another task to fill. If someone finds themselves using something like this where they do the same calculations on different data, it is highly recommended to use a job array.
Since this sbatch script does not specify the number of nodes, the nodes the four different tasks can be on is determined by what is available in the medium partition (the default partition since it is unspecified).
Specifying tasks and tasks per node
#!/bin/bash
#SBATCH --ntasks 2
#SBATCH --ntasks-per-node 5
#SBATCH --output out.txt
module purge
module load Python/gcc/3.7.5/gcc-6.3.0
source ./mpiVirtualEnv/env/bin/activate
mpiexec -n 2 python3 helloMpi.py
This sbatch script will allocate two tasks, and limit the maximum tasks per node to five, and then create two processes of running helloMpi.py, one in each task. Since ntasks defaults to one task per node, each task will be on a different node.
Specifying nodes and tasks per node
#!/bin/bash
#SBATCH --ntasks-per-node 5
#SBATCH --nodes 2
#SBATCH --output out.txt
module purge
module load Python/gcc/3.7.5/gcc-6.3.0
source ./mpiVirtualEnv/env/bin/activate
mpiexec -n 10 python3 helloMpi.py
This sbatch script will allocate five tasks per node, and select two nodes to run on. This will allocate a total of ten tasks, which then each be filled by a process of running helloMpi.py on it. There is no way to know what the ratio of tasks on the first node to tasks on the second node.
Specifying nodes, tasks per node, and tasks
#!/bin/bash
#SBATCH --ntasks-per-node 5
#SBATCH --nodes 2
#SBATCH --ntasks 10
#SBATCH --output out.txt
module purge
module load Python/gcc/3.7.5/gcc-6.3.0
source ./mpiVirtualEnv/env/bin/activate
mpiexec -n 10 python3 helloMpi.py
This sbatch script will set the maximum tasks per node to five, and select two nodes to run on. This will allocate a total of ten tasks, and since there cannot be more than five tasks per node, both nodes must get five tasks each. Each task is then filled by a process of running helloMpi.py.
Multi-threading
#!/bin/bash
#SBATCH --ntasks 1
#SBATCH --cpus-per-task 16
This script will allocate a single task to have 16 cores to enable multi-threading. Generally, this is only done if the user desires multi-threading specifically.
How to verify that all of the selected cores are being used:
While the job is running, use squeue to verify what compute node it's on, then ssh into that node.
squeue
ssh compute124
Run htop to verify the core usage and statistics to visually verify that it is correct.