HPC Cluster

User Guide

A brief user guide for cluster usage.

Login

How to access the cluster

Get an account (only once)
1. To get an account, you have to ask your group leader (usually your Ph.D. advisor or the relator for the thesis) to create it for you.
2. Add your email to the cluster mailing list (optional but highly suggested)
  Once your account is activated, is highly suggested to include your email in the cluster mailing list at:
  https://docs.google.com/spreadsheets/d/1wEW7PfDTxN1Gsayc3M8xC-_vNGGgXLphqBswsQmT47o/edit?usp=sharing
  You only have to insert your email in the first free row (one email per row).
  In this way, you will get all the updates and important events in the cluster, such as maintenance scheduling, news, etc.
Connect to the Sapienza network
You have three possibilities:
1. Be connected to a computer in the Sapienza network
2. Use the Sapienza VPN (follow the instructions reported here: https://web.uniroma1.it/infosapienza/servizio-vpn-di-ateneo).
  NOTE: This is possible ONLY for PhD Students, Researchers, and Professors since it requires a @uniroma1.it email to access.
3. Use the Computer Science Dept. VPN. It is available at the following link: https://drive.google.com/file/d/14G95lT9PExqIJ1xf942dh_1G22wlQYr1/view?usp=sharing
  Download and Install it and then log in with the usual credentials (email and password).
  NOTE: Differently from Sapienza VPN, if you use this one you have to access directly to the submitter node (IP: 192.168.0.102).
  NOTE: This is available ALSO for the student but PhD Students, Researchers, and Professors can use it as well. In any case, it requires step 1 (ask your leader to create an account for you in the cluster, otherwise you can enter in Sapienza network but still not use the cluster).
Connect to the cluster
Once you are inside the Sapienza network (point 2), choose one of the two options:
1. ssh user@151.100.174.45
  ssh user@submitter
2. ssh -J user@151.100.174.45 user@submitter

Now you are inside the submitter node where you can manage your jobs.

What you need to use: HTCondor

HTCondor is a software system that creates a High-Throughput Computing (HTC) environment. It effectively uses the computing power of machines connected over a network, i.e., the department's cluster. Its power comes from the ability to effectively harness shared resources with distributed ownership.

In other words, you need HTCondor to launch and manage your job (the programs and their respective environment) into the cluster.

Some preliminaries:

The cluster mounts the following shared volumes:
1. /data1 (use this for your job and data)
2. /data2
3. /home (is the way HTCondor manages credentials along the different nodes.

Since "home" is shared between each cluster node, your data can be used in all the nodes, including the submitter (where you can manage your jobs).

How to manage your jobs

Once you are logged into the cluster (see above), you are located in the HTCondor submitter node, in which is possible to build, start, remove, and manage your jobs. Moreover, it is possible to access the internal registry.

Note: DO NOT build a docker on the submitter frontend.

How to Submit a Job in HTCondor

How to submit a job

condor_submit example.sub

The command condor_submit is required to submit a (set of) job(s) to the central manager. The number of jobs must be specified in the .sub file (see below for some examples). The central manager will choose the destination workernode(s) based on the chosen characteristics/job requirements.

Example of .sub file.

The .sub file is the job descriptor, which must contain the configuration of the job. It is structured as a dictionary, as shown in the example.

Example of .sub file for the Vanilla Universe (see below for a more comprehensive description of Universe)

* Executable = foo.sh

* Universe = vanilla

Error = bar/err.$(ClusterId).$(ProcId)

Output = bar/out.$(ClusterId).$(ProcId)

* Log = bar/log.$(ClusterId).$(ProcId)

arguments = arg1 arg2 $(ProcId)

request_cpus = 1

request_memory = 1024

request_disk = 10240

should_transfer_files = yes

* Queue 1

* represents required fields.

Parameters description

Executable = job to execute

Universe = vanilla or docker

Error, Output, Log = Error, Output e Log files to use. In the example, they are identified by ClusterID and ProcID variables. The folder that contains these files MUST exist before the job is executed, otherwise, it will generate an error and block the execution.

Arguments = Input(s) that is(are) provided as a string array to the job.

request_* = required characteristics of the workernode which to submit the job. Even if they are not strictly required, it is suggested to provide them for an ideal use of the infrastructure.

Queue = This parameter, by default with value = 1, allow to run the same job several times. Each of these jobs will have an increasing value for the variable $ProcId.

Some more complex examples of .sub file

Example 1:

Launch bunch of jobs (3) with different arguments (with the same ClusterID for the bunch)

universe = vanilla

Executable = foo.sh

Log = bar.$(ClusterId).$(ProcID).log

Output = bar.$(ClusterId).$(ProcID).out

Error = bar.$(ClusterId).$(ProcID).err

Arguments = neutron

Queue

Arguments = tty

Queue

Arguments = httbar

Queue

The same task can be also submitted like this:

universe = vanilla

Executable = foo.sh

Log = bar.$(ClusterId).$(ProcID).log

Output = bar.$(ClusterId).$(ProcID).out

Error = bar.$(ClusterId).$(ProcID).err

Arguments = $(Item)

Queue 1 in (neutron, tty, httbar)

where $(item) is the variable representing the values provided in Queue.

Example 2:

Launch of 3 jobs with a pair of arguments (1, foo) (2, bar)…

universe = vanilla

executable = test.sh

arguments = $(IVAL) $(JVAL)

output = output_test_for_$(Process).txt

+MountData1 = False

+MountData2 = False

+MountHomes = True

ILIST = 1,2,3

I = $(Process)

IVAL = $CHOICE(I, ILIST)

JLIST = foo, bar, baz, foobar

JVAL = $CHOICE(IVAL, JLIST)

queue 3

For more information, please consult:
https://htcondor.readthedocs.io/en/latest/users-manual/submitting-a-job.html?highlight=submit

HTCondor' Universes

Universe

The universe is the job executing ambient. HTCondor can manage several universe:

vanilla grid java scheduler local

parallel vm container docker

A more comprehensive list and guide is provided here:

https://htcondor.readthedocs.io/en/latest/users-manual/choosing-an-htcondor-universe.html?highlight=universe

Note: The libraries are not and will never be directly installed in the nodes. This is to avoid compatibility problems among different research groups. However, you can build your docker (see below) to setup your environment).

Docker Universe (suggested)

This universe allows for the use of public or handmade containers. In this way, you can specify the all the libraries that you project needs and auxiliary file and directories. An example of a .sub file that uses this universe is provided here:

universe = docker

docker_image = ubuntu:20.04

executable = /bin/cat

arguments = /etc/hosts

transfer_input_files = input.txt

transfer_output_files = output.txt

should_transfer_files = YES

when_to_transfer_output = ON_EXIT

output = out.$(ClusterId).$(ProcId)

error = err.$(ClusterId).$(ProcId)

log = log.$(ClusterId).$(ProcId)

request_cpus = 1

request_memory = 256M

request_disk = 1024M

+MountData1 = FALSE

+MountData2 = FALSE

+MountHomes = FALSE

queue 1

The field docker_image specify the docker image to download from Docker Hub.

The field should_transfer_files specify whether to copy the file in the container or not.

The fields transfer_input_files and transfer_output_files specify the files and directories that need to be transferred in the container.

These options are not necessary if the files are in the shared volumes.

The options MountData1,MountData2,MountHomes are used to specify to HTcondor whether the shared volumes must be visible or not inside the container.

More information here:

https://htcondor.readthedocs.io/en/latest/users-manual/docker-universe-applications.html

Note:

The allowed users can push their handmade docker images on the local repository, called di.registry:443 with these commands:

$ docker -t di.registry:443/containerName

$ docker push di.registry:443/containerName

In the submission file, specify directly di.registry:443/containerName in the docker_image field.

To have a view of the already included images in the local repository, use the command:

$ curl -k https://di.registry:443/v2/_catalog

For more information on how to manipulate the department docker registry, consult the following site:

https://docs.docker.com/registry/spec/api/

Parallel Universe (for more expert users, only if you need parallelism)

The Parallel universe allows for the submission of parallel jobs. The scheduler will wait to have the required resources before executing the jobs in the required slots.

universe = parallel

executable = foo.sh

log = logfile

#input = infile.$(Node)

output = outfile.$(Node)

error = errfile.$(Node)

machine_count = 2

request_cpus = 8

should_transfer_files = IF_NEEDED

when_to_transfer_output = ON_EXIT

queue

The two parameters machine_count and request_cpus respectively describe the number of parallel processes and the required CPUs for each process.

To use Openmpi it is necessary to use a script that allows the communication between the jobs.

The script with default parameters is located in the submitter node (192.168.0.102) in the /OpenmpiScript folder. You can COPY the .sh file in your submission folder.

Note:

In this case is required to specify in the .sub file how to execute the preconfigured script and provide as the argument the mpi process (e.g., mpicc, mpif70 are already available in the submitter node PATH).

universe = parallel

executable = openmpiscript.sh

arguments = mpitest

should_transfer_files = yes

transfer_input_files = mpitest

when_to_transfer_output = on_exit_or_evict

output = logs/out.$(NODE)

error = logs/err.$(NODE)

log = logs/log

machine_count = 15

request_cpus = 10

queue

More detail in:

https://htcondor.readthedocs.io/en/latest/users-manual/parallel-applications.html

How to Manage your Jobs in HTCondor

After you launch a job, how can you know whats happening?

HTCondor provides you with some useful commands to deal with it.

Let clusterID.procID be the identifier of a job (see above "How to submit a job").

How to check the queues

condor_q

The command condor_q allows checking the status of the jobs.

Options:

-nobatch Shows one job for line (it deactivates the option -batch).
-global Shows all the active queues in the pool.
-hold Returns information about the jobs whose status is "hold". Use -analyze or --better-analyze options for more information.
-run Returns information about the jobs whose status is "running". It must be used with the -nobatch option.

How to remove a job

condor_rm clusterID.procID

It removes the clusterID.procID job from the cluster.

How to hold a job

condor_hold clusterID.procID

It puts the clusterID.procID job in "hold". In this way, it will not be scheduled until it is released (see below). If the clusterID.procID job is in the "running" status, it will be interrupted, and the node will be freed.

How to release a job

condor_release clusterID.procID

It releases the clusterID.procID job. In this way, it will leave the "hold" status and can be re-scheduled (i.e., its status is set to "idle").

For other commands, check the complete user guide:

https://htcondor.readthedocs.io/en/latest/users-manual/index.html

For a complete HTCondor user guide, please refer to the official HTCondor site:
https://htcondor.readthedocs.io/en/latest/users-manual/index.html