This guide explains getting setup on the High Performance Computing (HPC) systems we use. A few notes about terminology used in this guide:
The ‘local system’ refers to the internal CFD servers, for which oddjob is the firewall.
The ‘HPC’ refers to the high performance computing system being used. This can change over time, but is currently Niagara on SciNet.
>> is the command prompt.
Replace any instances of <username> or similar in the instructions below with your Niagara username.
Follow the below steps to get a Compute Canada account:
Go to the Compute Canada Database (CCDB) at https://ccdb.computecanada.ca
Click on `Register'; agree to the Acceptable Usage Policy; then fill out and submit a Compute Canada account application. Indicate on the application form in that you are a `Sponsored User'. Enter the PI’s Compute Canada Role Identifier (CCRI); Prof. Zingg's CCRI is syu-780-01.
The PI will receive a request to approve the account application.
After a few days, the you will receive a Compute Canada account confirmation email. Follow the link it provides to confirm your application.
The main HPC cluster in use currently is Niagara at SciNet. This is a large cluster for parallel computations on which our group has a resource allocation. Information, including useful setup guides can be found at their wiki, and there is a good introductory presentation here.
The system is divided into three partitions, /home, /scratch, and /project.
/home: Your home partition where you will keep your code, and other small, long-term files. Storage is relatively limited, but is backed-up. Located at /home/z/zingg/<username>
/scratch: This is where you will run your jobs, and is the only partition to which the compute nodes can write. It is quite large, but files older than 60 days are deleted on the 15th of each month. So once runs are completed and if they have results you want to keep, they should be moved to /project. Located at /scratch/z/zingg/<username>
/project: This is for long term storage. It is relatively large, although space should be considered, as we tend to fill it up. Store final results here. It is backed-up. Located at /project/z/zingg/<username>
IMPORTANT: As noted above, files older than 60 days are purged from /scratch. So, make sure to move any case results you want to keep to /project! You should be regularly moving final results to /project, and if you have a shorter project (e.g. summer student, MEng) make sure to move any of your final results at the end of your project before you leave. Group space on /project is limited, so only keep case results that are final - e.g. not any temporary 'test' cases that didn't work out. At the start of each month, when you log on to Niagara, it will provide you with the path to a file with a list of files scheduled for deletion. This is a good time to move any of these directories to /project if they need to be kept.
Niagara consists of a number of `nodes', each of which contains 40 processors. There are two types of nodes - login nodes and compute nodes. Login nodes are accessible from the outside world. Compute nodes are only accessible through the scheduler.
Once you have a Compute Canada account, you will need to setup an account on Niagara. To do this, simply select "Join" on this page. It will take a day or two for your account to be setup.
Once you have your account, you will need to setup ssh keys to allow access. Niagara uses ssh keys to manage access for improved security. To setup and use ssh keys is a two step process:
Generate a set of keys on your machine (either your own computer, the lab computer, or both)
Put a copy of your public key on the Compute Canada database
For step 1, you can follow the instructions here under "Creating a key pair" and then for step 2 follow the instructions here under "Installing your key".
Once you have ssh keys setup, login with
>> ssh -Y <username>@niagara.computecanada.ca
First, your .bashrc file should be modified. This file is in your home directory and configures aspects of your account. It is a text file and can be edited with any text editor such as 'vi' or 'emacs'. It must include the following lines. Add these at the end of the basic default .bashrc file you will find in your home directory. (Note: "uname" in the below is not your username, but a command. Do not replace it with your username. The below text can be copy-pasted as-is.)
HOST=$(uname)
ARCH=$(uname)_$(uname -m)
export sys="niagara"
export F_UFMTENDIAN=big
ulimit -c 0
ulimit -s unlimited
module load intel
module load intelmpi
module load git
export PATH=${PATH}:$HOME/bin:$HOME/grid_utils
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HOME/$ARCH/lib
Also add any aliases you may find useful. Senior group members can provide you with a .bashrc they use, or with suggestions for useful alias/functions. Once edited it must be ‘sourced’ for the changes to be applied. So, in a terminal run
>> source ~/.bashrc
Another useful step includes creating soft links to common locations in your home directory with
>> cd
>> ln -s /scratch/z/zingg/<username>/ scratch
>> ln -s /project/z/zingg/<username>/ project
You will need a bin directory for housing executables. Create this by
>> mkdir ~/bin
Jetstream is run on the HPCs. These HPCs are shared by users across Canada. Thus, Jetstream runs (aka jobs) must be submitted through a scheduler (on Niagara this scheduler is SLURM) which tries to ensure all users can run as much as possible, but prioritises certain groups based on an allocation competition. Each year our research group receives a resource allocation (in core-years). This translates into a priority with which our jobs are scheduled. As jobs are submitted to the scheduler they are put into a queue. Jobs with higher priorities (as determined by a fairshare algorithm) move through the queue more quickly, as do smaller jobs (fewer cores and shorter wall times.) Since we have a Compute Canada allocation, we typically have a relatively high priority. However, as our group runs more jobs, our priority drops. This drop occurs over a one week rolling window, i.e. our current allocation is calculated based on our group's last seven day usage. Thus, if our group is using a lot of resources our priority will drop and you may have to wait for your jobs to run. In some cases you may have to wait a few days for your job to get through the queue. Be patient. This is just the way it is.
Jetstream must be run on your /scratch space, as this is where the compute nodes are allowed to write output files. Say you have a case you want to run in scratch/z/zingg/<username>/test, and it contains all the necessary input and grid files. Running jobs is done though a SLURM scheduler. You submit jobs via a run script. A sample script looks like
#!/bin/bash
#SBATCH --nodes=8
#SBATCH --ntasks=290
#SBATCH --time=24:00:00
#SBATCH --job-name=test-job
#SBATCH --output=screen
#SBATCH --error=%x.e%j
cd $SLURM_SUBMIT_DIR
mpirun ~/bin/jetstream_x86_64
The inputs are as follows:
--ntasks: The number of processes (typically equal to the number of processors) you want the job to run on. Usually, you will set this to be equal to the number of blocks in you grid.
--nodes: The number of compute nodes required. Each compute node has 40 processors (on Niagara), so you want a sufficient number of nodes to accommodate the required --ntasks.
--time: The run time requested in hh:mm:ss.
--job-name: A convenient job name to identify the job.
--output: A file where standard output goes. Keep this as screen.
--error: A scheduler error output file. The above setting equates to <job-name>.<job-number>.
mpirun: This last line specifies the name of the executable to be run, e.g. jetstream_x86_64 in this case.
To submit the script to the scheduler, do
>> sbatch <run-script-name>
You can check the status of your queued jobs with
>> squeue -u <username>
For debugging (either during development or if you want to test a production run to make sure it will at least start successfully), Niagara has a number of high-turnover debug nodes. These can be accessed with the command
>> debugjob --clean N
where N is the number of compute nodes you want to request. This will submit the job for between 22 minutes and 1 hour depending on how many nodes you request. The debugjob command will submit a short job to the queue. Unlike when submitting a submission script, where the job is given to the scheduler and then run based upon what is in the submission script, the debug jobs are interactive jobs. Once it begins, it will log you into one of the compute nodes and then wait for your input. Once on the compute node, use the mpirun command to execute Jetstream. You will need to specify the number of processes as an argument to mpirun, with --np <number of processes>. For example, a debug session where you want to run Jetstream on 64 processes would look like:
>> debugjob --clean 2
debugjob: Requesting 2 nodes with 80 tasks for 60 minutes and 0 seconds
SALLOC: Pending job allocation 1799411
SALLOC: job 1799411 queued and waiting for resources
SALLOC: job 1799411 has been allocated resources
SALLOC: Granted job allocation 1799411
SALLOC: Waiting for resource configuration
SALLOC: Nodes nia0598 are ready for job
>> cd ~/scratch/test
>> mpirun -np 64 ~/bin/jetstream_x86_64
Standard I/O will go to the screen, so you will want to look at output with a different terminal. To stop the executable (e.g. if you get to the point in the run that you want to test), type CTRL+C. You will then have control of the debug node again, and you can rerun the executable, etc. Once you are done debugging, type exit to leave the debug session.
Data visualization (CFD results, shapes, line plots, etc.) is mostly done with Tecplot 360. In the lab, launch Tecplot with
>> tec360
When running Tecplot for the first time from the lab computers, you will have to give it the license info. When, launched, choose the "Network License Manager" option and enter the following:
IP : 192.168.2.1
Port: 27100
You can also install Tecplot on your own computer for working from outside the lab. To do so, download the evaluation version here. Click on "Download Tecplot 360", then "Free Trial Software" on the following page. When launched for the first time, it will ask for a license option. Choose "Network License Manager" and use the following when asked:
IP : 128.100.201.72
Port : 27100
Visualization cannot be done on the HPCs. Files must be transferred to the lab network (or your own computer). Do so with the scp command. For example, to push a file from the HPC to your local system do
>> scp <file-on-hpc> <local-username>@oddjob.utias.utoronto.ca:~/<path-to-destination>
or to pull a file from the HPC to your local system do
>> scp <hpc-username>@niagara.computecanada.ca:<path-to-file> <path-to-destination>