SLURM (Simple Linux Utility for Resource Management) is a powerful workload manager used to efficiently manage and schedule computing resources in a cluster environment. It's like an organized queue at a busy restaurant, but instead of people waiting for a table, it manages tasks waiting for computational resources.
How does it work?
Imagine you have a bunch of tasks (jobs) that need to be done on a computer cluster. Some tasks may need a lot of resources, while others may require fewer. SLURM helps manage these tasks by prioritizing and allocating resources efficiently.
Key Concepts
Job submission: You submit your tasks (jobs) to SLURM, specifying how much resources each job needs and how long it's expected to run.
Queue: SLURM maintains a queue of pending jobs. Jobs wait in the queue until the required resources become available.
Resource allocation: When resources become available, SLURM assigns them to jobs based on factors like priority, resource requirements, and fairness.
Job execution: Jobs are executed on the allocated resources. SLURM monitors their progress and manages any issues that arise.
In summary, SLURM simplifies the process of managing and scheduling jobs in a cluster environment. By efficiently allocating resources and prioritizing tasks, it helps maximize the utilization of computing resources and ensures fair access for all users.
Submitting Jobs
To submit a job to SLURM, you typically create a script specifying job details such as resource requirements, commands to execute, and output locations. You then submit this script to SLURM using the sbatch command.
sbatch myscript.sh
Example SLURM Script
#!/bin/bash
#SBATCH --job-name=myjob # Job name
#SBATCH --partition=compute # Queue (partition) to submit to
#SBATCH --nodes=1 # Number of nodes
#SBATCH --ntasks-per-node=1 # Number of tasks (CPU cores) per node
#SBATCH --time=01:00:00 # Walltime (expected job duration in DD-HH:MM:SS)
# Commands to run:
echo "Hello, SLURM!"
Here's an overview of some commonly used SBATCH options:
--job-name: Specifies the name of the job. This option is used to identify the job in SLURM's output and logs.
--partition: Specifies the partition (queue) to which the job should be submitted. Different partitions may have different policies, resources, or access restrictions.
--nodes: Specifies the number of nodes required for the job. A node typically represents a single machine with one or more CPUs.
--ntasks: Specifies the total number of tasks (or CPU cores) required for the job across all nodes. (--ntasks 1 requires all tasks to be on the same node)
--ntasks-per-node: Specifies the number of tasks (or CPU cores) to be launched on each node. This option is useful for parallel jobs.
--cpus-per-task: Specifies the number of CPU cores per task. This option is an alternative to --ntasks-per-node.
--mem: Specifies the amount of memory required for the job. You can specify memory in various units such as MB, GB, etc.
--time: Specifies the maximum time the job is allowed to run. This option is also known as walltime and is specified in days-hours:minutes:seconds format.
--output: Specifies the file where standard output from the job will be written.
--error: Specifies the file where standard error from the job will be written.
--mail-type: Specifies the type of email notifications to be sent. Options include BEGIN, END, FAIL, REQUEUE, and ALL.
--mail-user: Specifies the email address to which notifications should be sent.
Managing resources and time
Being mindful and efficient in asking for resources like time and CPUs is essential for maximizing productivity and resource utilization in a computational environment. Here are some tips to achieve this balance:
Plan and Estimate: Carefully plan your job's resource requirements in advance. Estimate the necessary time and CPUs needed for efficient completion.
Monitor and Optimize: Monitor resource usage during job execution and adjust resource requests based on actual usage patterns. Optimize code and utilize parallelism effectively to minimize resource consumption.
Communicate and Consideration: Communicate with cluster administrators when needed and be considerate of other users' resource needs. Avoid excessive resource requests and promptly release allocated resources after job completion.
By following these tips, you can strike a balance between being mindful of resource usage and efficiently utilizing computational resources to accomplish your tasks. This approach promotes productivity, minimizes resource wastage, and fosters a collaborative computing environment.
Checking the state of a submitted job
You can check the state of the general queue with:
squeue
And only the current state of your specific submitted jobs:
squeue -u username
Cancelling a submitted/running job
Sometimes we realize after submitting that we did something wrong, or that the script is not running as intended. One can then cancel the job using the job_id. You can always see the job id with the squeue command.
scancel job_id
Stats of the job
In order to get more detailed information about the submitted job we should find its job_id and run:
scontrol show job_id -dd job_id
After running the command, SLURM will display detailed information about the specified job, including its state, resource allocation, time limits, dependencies, and much more. This can be helpful for debugging, monitoring job progress, or gathering detailed information for analysis.
Alternatively, we can view details of finished jobs, after midnight of the day it finished, with:
sacct