Welcome To MCC‎ > ‎

Using The Hardware

As mentioned on the hardware organisation page, each user has a home directory in /user/user_name. This is available on all nodes in the cluster. In addition each compute node has a scratch directory /scratch that must be used by heavy IO jobs in order to distribute load on the file system instead of on the shared storage system where home directories are stored.

There are no backup of files stored on MCC - this is the responsibility of the users themselves. If your files are important you have to keep a remote backup of these files. We do run the disk system with some redundancy (RAID6), but do not count of it as being reliable. We do have a second storage system which we use for backup purposes. It is currently on a different network and only accessible through a 1 Gbit ethernet connection shared with several other machines. Therefore, it is currently primarily used for occasional full backups of the production storage system. We are working with the IT department to improve the situation.

Prerequisites

In order to get access to run software on compute nodes users have to request this through the job scheduler (we are using slurm-llnl or simply slurm).
In order to be able to use slurm you will need to in add it to your PATH environment variable. This is done by adding:

  PATH=/pack/slurm/bin:$PATH; export PATH

to your ~/.bashrc file.

After loading the changes in your bash session, e.g. by running:

  source ~/.bashrc

you will be ready to run jobs on the cluster.

Overall, there are three ways of using the cluster through slurm scheduler:
  • Interactive jobs
  • Batch jobs
  • MPI batch jobs

Interactive jobs

Interactive jobs give the user a terminal where he can input commands and see the result, input another command, see the result and so forth. These jobs can be useful for compiling software or running a small example. An interactive job is started like this:

  srun --pty -u bash -i

One can request specific configuration of resources (number of nodes, CPUs per node etc) by passing extra keys, consult with manual "man srun".
However, as you start running more and bigger jobs you will most likely find that you will be much more productive by running batch jobs.

Batch jobs

Batch jobs allow you to run jobs in batches without having to manually input command at the same time it also helps improve the utilisation of the hardware since jobs can be put in a queue and be executed when the hardware is available. Below gives a small hello world example of a batch script. This script should be save in a file, say "hello.sh". Then it can be run by running:

  sbatch hello.sh

You may notice the lines that look like comments but which begins with "#SBATCH" these lines contain arguments that will be read by sbatch when you run the command. Please see the man page for sbatch for more details on options.

hello.sh:
#!/bin/sh
#SBATCH --time=5:00
#SBATCH --nodes 2
#SBATCH -n 4
#SBATCH --partition=production
#SBATCH --mail-type=END # Type of email notification- BEGIN,END,FAIL,ALL 
#SBATCH --mail-user=username@cs.aau.dk
 
echo hello world
echo $SLURM_NODELIST
echo $SLURM_NPROCS
pwd hostname

This will create an output file named slurm-JOBID.out in your current directory and will send you an email when the job is done.

To inspect the running jobs use the following:

    squeue

Cancel the job like this:

    scancel <JOBID>

MPI batch jobs

The former hello world example will allocate 4 cores on two nodes. It will then run the commands on one of these cores on one of the machines. Actually, the last command "hostname" will print out the hostname of this node (the node where the previous commands have been executed). The idea is that this machine should bootstrap the job. This can be done by reading in a number of environment variables that slurm sets up - e.g. see the last two "echo" lines.

Now while it is indeed possible to start a job like this it will usually be rather time consuming. Luckily, for most cases this can be avoided. The Message Passing Interface(MPI) Standard defines ways to communicate taking advantage of the InfiniBand interconnect mentioned in the hardware section. Several implementation of this standard have been made available but luckily they are fairly interchangeable. If you are not familiar with MPI please find a book on the topic or read up on it elsewhere, e.g., http://www.netlib.org/utk/papers/mpi-book/mpi-book.html . Usually, MPI applications are developed in c,c++, fortran and in recent years python - these are usually the easiest to get started with as there is very good integration with MPI. 

If you already have an mpi application and just want to compile and run it can be done following this procedure. For an example we will assume you have an application called mpi_array.c in your current directory and we will use mpich as our MPI implementation(openmpi does not work well with the version of slurm we are running - however it does in some cases provide better error messages).

  1. Before compiling add mpich to your PATH environment variable. Your .bashrc should contain a line like this: PATH=/pack/mpich-3/bin/:$PATH; export PATH
  2. Reload your .bashrc: source .bashrc
  3. Use mpicc (instead of gcc) to compile the application. Run: mpicc mpi_array.c -o mpi_array
  4. Create mpi_array.sh based on script seen below
  5. Submit a batch job using the mpi_array binary. Run: sbatch mpi_array.sh

Please note: that you should NOT compile on login1/login2 nodes as these are virtual machines. Instead, you should start an interactive job and compile on the hardware - which makes it easier to let the compiler to optimize for the hardware.

What will happen is then that mpirun will get the arguments listed in the job script (the #SBATCH lines) from slurm and automatically bootstrap the job on each core, use the fastest interconnect and try and find the best way to allocate memory.

mpi_array.sh:

#!/bin/sh
#SBATCH --time=5:00
#SBATCH --nodes 2
#SBATCH --sockets-per-node 8
#SBATCH --cores-per-socket 8
#SBATCH --partition=production
mpirun mpi_array

Exclusive access to compute nodes:

By default we allow multiple users to use the same machine concurrently. E.g., if you run  a job that requires 2 cores and 100Gb memory then other users will be allowed to use the 9000Gb and 62 cores that would otherwise be idle. For benchmarking this may be problematic and in that case you can use the --exclusive argument to avoid this behavior. Please see the man page for more information.

Debugging:

In order to debug MPI applications please see this link.

Insides into how your jobs are utilizing the hardware:
In order to get a better idea of how your software performs you can study the monitoring data. The picture below shows how many allocations originated from other numa nodes for numa node1-4 on compute1. For more details on how to read the numa numbers please look up the numastat man-page. To access the monitoring data you can do like this on a Linux system:
    ssh -X frontend1.mcc.uppaal.org
    username$ firefox --no-remote
<Enter the IP: 192.168.2.15 into the browser window you just opened>






ą
mon1.png
(86k)
Andreas Engelbredt Dalsgaard,
Mar 31, 2015, 6:12 AM
ċ
mpi_array.c
(4k)
Andreas Engelbredt Dalsgaard,
Dec 3, 2014, 6:03 AM
ċ
mpi_array.sh
(0k)
Andreas Engelbredt Dalsgaard,
Dec 3, 2014, 6:03 AM
Comments