Cluster Job Issues

General Job Issues

When writing code and sbatch scripts, you want to add the tiniest bit of code that moves towards your goal, run it to make sure it works, and then add another tiny piece, and repeat. This will help you build your code and make sure it all works, as well as pinpointing where you run into issues.

If you are having issues using one of the clusters, this process will help you narrow down where your issue stems from. Be sure to do this list in order.

1. Make sure the software you are attempting to use exists and is loaded correctly.
  1. 1. Exit the node you are connected to and then sign back in.
    2. Load the software in the same manner that your sbatch script does.
    3. Boot the software.
      1. Example: Exit, log in, load the module Python/3.6.4/gcc-6.3.0, and then type "python3" to start the Python terminal.
2. Use the sinfo command to verify that there is enough available partitions to run your job.
3. Submit an sbatch script that only does srun hostname on all the desired nodes.
4. Download and run our MPI Hello World program on the nodes you are trying to use; it is recommended to use the Hello World program for C and then move to your target language. These can be found under Getting Started's SLURM Scheduler in the top right menu of this site.
5. If all of the above works, use salloc to allocate the subject nodes and ssh into them. Try to use basic commands just to verify that the node is truly functional.

If you have an error of "User not found on host," go to go.pdx.edu/help, click Common Requests, and click "Get IT Help." Fill out that form, and in the Summary of Request line, put "Research Computer Help: 'User not found on host.'" In the description, please include the entire "User not found on host" block of error messages.

Performance Issues

If your job is having issues running as fast as anticipated, there are several methods that can assist with determining the cause of this event. Try these after submitting a job and ssh-ing into the compute node(s) it is running on. An example of this would be that if squeue shows your job is on compute[125-126] or compute126, you would tunnel into compute126 with ssh compute126 .

- - Be sure you are running the job with an MPI binary, such as mpiexec (recommended) or mpirun.
  - Use the htop command to verify if your processes are actually using all the available/desired processors and not spinning idly.
    - - Also be sure to check that this partition has enough cores for what you have requested.
  - Use the free command to verify if your processes are actually using the RAM and is busy. If more RAM is needed, use sbatch's --mem or --mem-per-cpu.
  - Make sure you have not requested too many cores for what is physically available for that partition. This can be checked on a running job with htop (refer to above) or going to Systems under the top right menu bar of this site and select the system to inspect.
  - Make sure you are using /scratch to read/write for your data and not your home directory or anywhere else.
  - For more on the Coeus systems, visit here. For more on performance, visit here.

Report abuse