Check Job Performance
Analyzing compute job performance
Initially your goal is just to get your process running. It's often important to know how well your job is running. Is your job using system resources effectively? How do check what system resources your application is using? Are you maximizing system resources available to you? In an ideal world, your process will use most of the available CPU and RAM on a compute node or server. There are some basic steps you can take to get an idea of how your job is running.
First, read the documentation (all of the pages for the OIT-RC hardware specifications can be found here). It should identify system requirements, and will often give you important clues as to whether your software can use multiple threads, run in parallel (i.e. use MPI), etc. If you're writing your own application, you can use the following steps as well, to track how well your software is using system resources.
Determining your Hardware Usage
Not sure about some of the terminology used in this FAQ? You can find answers in the Glossary and commonly used terms
Observing jobs in progress
If you want to know exactly how much RAM and CPU your job is using, there is several ways to do so.
Use Graphana web interface to get an overview of system resource usage. Graphana will show a node as green if it's being lightly used, yellow for more heavy usage, through orange and red, depending on how heavy the load is. This interface will give fairly detailed aggregate information about systems usage if you click on a specific node. Sometimes, the color can be somewhat deceiving, since it's displaying aggregate information.
For more detailed information, you can log in to the compute node where your process is running and check system usage using tools such as htop, free, (refer to How-To pages on using these tools.)
Adapting your SBATCH Script
Here is a collection of possible limitations and how to specify it in an SBATCH directive. For more on these, refer to SLURM Parallelism.
Include a memory setting. For example if you're application uses approximately 7-8GB RAM, specify 10GB (that amount and some extra).
#SBATCH --mem 10GB
Is it a single-threaded application, meaning it can only use a single core of a multi-core processor?
Add #SBATCH --ntasks 1
Add #SBATCH --cpus-per-task 1
Yes this says cpus but it means cores.
Can it use many cores?
Add #SBATCH --cpus-per-task <The number of cores that can be used (cannot exceed the maximum on a node)>
Can your application use MPI to scale across a number of servers (aka nodes)? Or do you need to run multiple copies of your job?
Add #SBATCH --nodes <The number of nodes to use>
Does your application use more than 128GB RAM?
If you are using one of the standalone Linux servers (Agamede, Circe, or Hecate), use Hecate, it has the most RAM.
If you are using Coeus, use the himem partition: #SBATCH --partition himem.
Does your application need to read/write a lot of files to/from disk?
Be sure to have the files being read/written in /scratch (for Coeus)
Scaling Up
As you scale your job on more nodes, it's a good idea to verify system usage at each step.
Before running a process with the scheduler, you can try a test run on the login nodes in order to verify that the application runs, that you have the path correct and you're loading necessary modules.
Next, try submitting a small process to the "short" queue using just one or two of the two available nodes. Are you able to submit without error? Does it appear that your application is running as expected? If so, continue.
If your application will only run on a single server, you can now submit the job to run on the proper partition.
If you're going to run multiple iterations of the same application, you'll want to learn about submitting array jobs with SLURM (more on that here).
If you're running a parallel application, once you are able to submit your jobs correctly, scale your processes up. Start with four nodes to begin. Are you maximizing those nodes (They will appear Orange or Red in the Ganglia interface. )
How to verify that all of the selected cores are being used
While the job is running, use squeue to output what compute node it's on, then ssh into that node.
squeue
ssh compute124
Run htop to track the core usage and statistics to visually verify that it is correct.