Expanse

Expanse is the latest supercomputer at the San Diego Supercomputer Center.

Documentation about expanse may be found here: https://www.sdsc.edu/support/user_guides/expanse

You should watch the excellent introduction to expanse here:

https://www.sdsc.edu/event_items/202202_ExpanseWebinar-M.Thomas.html

You will access expanse through your ACCESS account.

Accessing Expanse

You should be enrolled for Parallel computation class CIS240506 on ACCESS automatically by our TA team. If you are not enrolled and do not see this information kindly post it on Piazza privately and TAs will help you set it up.

You can use your ACCESS credentials to login to Expanse by typing the following on your terminal and then entering the password:

ssh <access_username>@login.expanse.sdsc.edu

After your login

Setup your .bash_profile copying the contents of the file and appending it to your bash profile:

/expanse/lustre/projects/csd911/bwchin/public/Profiles/BASH_PROFILE

to your ~/.bash_profile

This will load the necessary modules and setup ${PUB} to point to our class specific directory.

Our project group account is csd911

$PUB on expanse is :

/expanse/lustre/projects/csd911/bwchin/cse260-sp22/public

You have an allocation on expanse. Expanse usage is measured in SU (standard units of compute). One core costs one SU (roughly) as it is charged at one SU per core per hour. Our jobs SHOULD be very short (less than 5 minutes) so our SU usage may be less.

However, you want to be very careful about using SUs (similar to amazon we have limited funds). Unlike amazon, when we run out, it is difficult/impossible to get more. The SUs are allocated under an NSF grant and we have to apply for them. So use them carefully. Being on the login node does not consumer SUs (but you should not be doing compute on the login nodes). Compiling is ok.

Common commands

expanse-client project csd911
- will tell you how many SUs you have used.
squeue -u <access_username>
- check for queue status
scancel jobid
- To cancel a submitted job:

Running Interactive Jobs On Expanse

Interactive jobs can run on up to one node. This will consume your SUs since if you want to debug an 8 core MPI job, using the debug queue will allow you to use just 8 cores (and share the machine with others). The COMPUTE partition will allocate an entire node (128 cores) whether you use them or not and you will be charged for the full 128 cores. Using the DEBUG partition charges you only for the cores you ask for.

The following command will give you one core for 30 minutes.

srun --partition=debug --pty --account=csd911 --ntasks-per-node=1 --nodes=1 --mem=1G -t 00:30:00 --wait=0 --export=ALL /bin/bash

You can change tasks-per-node to get more cores (e.g. to debug MPI) but be careful as the cores will be charged to you whether they are idle or active. A better approach might be to submit a batch job with sbatch.

Running Batch Jobs On Expanse

what queues (partitions) are available and their status

sinfo -a

-------------------------

There are two batch partitions (queues) that we will use. shared and compute

shared allows you to share a compute node. You can ask for up to 128 cores on a shared node. You will be charged for each core you use.

Compute allows you to use an entire node. We use the compute partition if we need more than 128 cores (e.g. 2 nodes or 256 cores). Use compute sparingly once you know your code is functioning well.

-------------------------

We a python script called makeslurm.py which can create a .slurm batch script for you.

The following is information about the .slurm script in case you want to create your own or modify the ones genearted by makeslurm.py

run a batch job on Expanse: (this examples uses the shared partition).

sbatch jobscriptfile

Sample jobscriptfile

#!/bin/bash

#SBATCH --job-name="hellompi"

#SBATCH --output="hellompi.%j.%N.out"

#SBATCH --partition=shared

#SBATCH --nodes=2

#SBATCH --ntasks-per-node=128

### This job runs with 2 nodes, 128 cores per node for a total of 256 tasks.

#SBATCH --mem=1G

#SBATCH --account=csd911

#SBATCH --export=None

## Time limit is HH:MM:SS

## do not change this unless you know what you are doing, You can easily run out of computer time

#SBATCH -t 00:02:00

export SLURM_EXPORT_ENV=ALL

module purge

module load cpu

#Load module file(s) into the shell environment

module load gcc

module load openmpi

module load slurm

srun --mpi=pmi2 -n 16 ../hello_mpi

Debug Performance Tools. (for reference, may need updating for fall 2024)

You can use tau on sdsc to help gain insight into how your program is working (where is it spending time, how does it communicate, etc).

There is extensive documentation on tau at : https://www.cs.uoregon.edu/research/tau/home.php

We have setup a tau environment for you.

To use tau, you simply have to load the apporpiate modules.

On expanse, do module spider tau

This will tell you what modules should be loaded. We will load the following:

make sure you have $PUB set to : /expanse/lustre/projects/csd720/bwchin/cse260-sp22/public
module spider tau tell su we need the following modules:

cpu/0.15.4 gcc/10.2.0 openmpi/4.0.4

load the required modules and load tau

module load tau

to profile:

change your execution script to run tau_exec

srun ..... ./apf ..... to

srun ..... tau_exec -io ./apf ......

make sure you module load tau in your sbatch script.

after running your program you can view the results with pprof. (the program will have created a statsfile in the direfctory in which you ran your program.

pprof is a command line tools which will dump the data to the screen.

paraprof is a graphical interface - but we haven't been able to get that to work so far due to some java library issue.

perf

perf is a linux performance tools that can access hardware performance counters.

perf stat <command>

will collect various statistics of your program (e.g. you can measure IPC).

man perf for details. Also an excellent resource is https://www.brendangregg.com/perf.html#CPUstatistics

Perf can also profile your code by recording statistics for parts of your program with perf record.

AMUuProf

module load AMDuProf will give you access to AMD's own performance collection tool. See this excellent video by SDSC's Bob Sinkovits on a good instrocution to AMDuProf. AMDuProf supports generaic statistic collection, profiling (like perf record) and instruction based sampling.

Some features of AMDuProf are not available on expanse for security reasons.

Report abuse