HPC Cluster
User Guide
A brief user guide for cluster usage.
Login
How to access the cluster
Get an account (only once)
To get an account, you have to ask your group leader (usually your Ph.D. advisor or the relator for the thesis) to create it for you.
Add your email to the cluster mailing list (optional but highly suggested)
Once your account is activated, is highly suggested to include your email in the cluster mailing list at:
https://docs.google.com/spreadsheets/d/1wEW7PfDTxN1Gsayc3M8xC-_vNGGgXLphqBswsQmT47o/edit?usp=sharing
You only have to insert your email in the first free row (one email per row).
In this way, you will get all the updates and important events in the cluster, such as maintenance scheduling, news, etc.
Connect to the Sapienza network
You have three possibilities:Be connected to a computer in the Sapienza network
Use the Sapienza VPN (follow the instructions reported here: https://web.uniroma1.it/infosapienza/servizio-vpn-di-ateneo).
NOTE: This is possible ONLY for PhD Students, Researchers, and Professors since it requires a @uniroma1.it email to access.Use the Computer Science Dept. VPN. It is available at the following link: https://drive.google.com/file/d/14G95lT9PExqIJ1xf942dh_1G22wlQYr1/view?usp=sharing
Download and Install it and then log in with the usual credentials (email and password).
NOTE: Differently from Sapienza VPN, if you use this one you have to access directly to the submitter node (IP: 192.168.0.102).
NOTE: This is available ALSO for the student but PhD Students, Researchers, and Professors can use it as well. In any case, it requires step 1 (ask your leader to create an account for you in the cluster, otherwise you can enter in Sapienza network but still not use the cluster).
Connect to the cluster
Once you are inside the Sapienza network (point 2), choose one of the two options:ssh user@151.100.174.45
ssh user@submitterssh -J user@151.100.174.45 user@submitter
Now you are inside the submitter node where you can manage your jobs.
What you need to use: HTCondor
HTCondor is a software system that creates a High-Throughput Computing (HTC) environment. It effectively uses the computing power of machines connected over a network, i.e., the department's cluster. Its power comes from the ability to effectively harness shared resources with distributed ownership.
In other words, you need HTCondor to launch and manage your job (the programs and their respective environment) into the cluster.
Some preliminaries:
The cluster mounts the following shared volumes:
/data1 (use this for your job and data)
/data2
/home (is the way HTCondor manages credentials along the different nodes.
Since "home" is shared between each cluster node, your data can be used in all the nodes, including the submitter (where you can manage your jobs).
How to manage your jobs
Once you are logged into the cluster (see above), you are located in the HTCondor submitter node, in which is possible to build, start, remove, and manage your jobs. Moreover, it is possible to access the internal registry.
Note: DO NOT build a docker on the submitter frontend.
How to Submit a Job in HTCondor
How to submit a job
condor_submit example.sub
The command condor_submit is required to submit a (set of) job(s) to the central manager. The number of jobs must be specified in the .sub file (see below for some examples). The central manager will choose the destination workernode(s) based on the chosen characteristics/job requirements.
Example of .sub file.
The .sub file is the job descriptor, which must contain the configuration of the job. It is structured as a dictionary, as shown in the example.
Example of .sub file for the Vanilla Universe (see below for a more comprehensive description of Universe)
* Executable = foo.sh
* Universe = vanilla
Error = bar/err.$(ClusterId).$(ProcId)
Output = bar/out.$(ClusterId).$(ProcId)
* Log = bar/log.$(ClusterId).$(ProcId)
arguments = arg1 arg2 $(ProcId)
request_cpus = 1
request_memory = 1024
request_disk = 10240
should_transfer_files = yes
* Queue 1
* represents required fields.
Parameters description
Executable = job to execute
Universe = vanilla or docker
Error, Output, Log = Error, Output e Log files to use. In the example, they are identified by ClusterID and ProcID variables. The folder that contains these files MUST exist before the job is executed, otherwise, it will generate an error and block the execution.
Arguments = Input(s) that is(are) provided as a string array to the job.
request_* = required characteristics of the workernode which to submit the job. Even if they are not strictly required, it is suggested to provide them for an ideal use of the infrastructure.
Queue = This parameter, by default with value = 1, allow to run the same job several times. Each of these jobs will have an increasing value for the variable $ProcId.
Some more complex examples of .sub file
Example 1:
Launch bunch of jobs (3) with different arguments (with the same ClusterID for the bunch)
universe = vanilla
Executable = foo.sh
Log = bar.$(ClusterId).$(ProcID).log
Output = bar.$(ClusterId).$(ProcID).out
Error = bar.$(ClusterId).$(ProcID).err
Arguments = neutron
Queue
Arguments = tty
Queue
Arguments = httbar
Queue
The same task can be also submitted like this:
universe = vanilla
Executable = foo.sh
Log = bar.$(ClusterId).$(ProcID).log
Output = bar.$(ClusterId).$(ProcID).out
Error = bar.$(ClusterId).$(ProcID).err
Arguments = $(Item)
Queue 1 in (neutron, tty, httbar)
where $(item) is the variable representing the values provided in Queue.
Example 2:
Launch of 3 jobs with a pair of arguments (1, foo) (2, bar)…
universe = vanilla
executable = test.sh
arguments = $(IVAL) $(JVAL)
output = output_test_for_$(Process).txt
+MountData1 = False
+MountData2 = False
+MountHomes = True
ILIST = 1,2,3
I = $(Process)
IVAL = $CHOICE(I, ILIST)
JLIST = foo, bar, baz, foobar
JVAL = $CHOICE(IVAL, JLIST)
queue 3
For more information, please consult:
https://htcondor.readthedocs.io/en/latest/users-manual/submitting-a-job.html?highlight=submit
HTCondor' Universes
Universe
The universe is the job executing ambient. HTCondor can manage several universe:
vanilla grid java scheduler local
parallel vm container docker
A more comprehensive list and guide is provided here:
Note: The libraries are not and will never be directly installed in the nodes. This is to avoid compatibility problems among different research groups. However, you can build your docker (see below) to setup your environment).
Docker Universe (suggested)
This universe allows for the use of public or handmade containers. In this way, you can specify the all the libraries that you project needs and auxiliary file and directories. An example of a .sub file that uses this universe is provided here:
universe = docker
docker_image = ubuntu:20.04
executable = /bin/cat
arguments = /etc/hosts
transfer_input_files = input.txt
transfer_output_files = output.txt
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
output = out.$(ClusterId).$(ProcId)
error = err.$(ClusterId).$(ProcId)
log = log.$(ClusterId).$(ProcId)
request_cpus = 1
request_memory = 256M
request_disk = 1024M
+MountData1 = FALSE
+MountData2 = FALSE
+MountHomes = FALSE
queue 1
The field docker_image specify the docker image to download from Docker Hub.
The field should_transfer_files specify whether to copy the file in the container or not.
The fields transfer_input_files and transfer_output_files specify the files and directories that need to be transferred in the container.
These options are not necessary if the files are in the shared volumes.
The options MountData1,MountData2,MountHomes are used to specify to HTcondor whether the shared volumes must be visible or not inside the container.
More information here:
https://htcondor.readthedocs.io/en/latest/users-manual/docker-universe-applications.html
Note:
The allowed users can push their handmade docker images on the local repository, called di.registry:443 with these commands:
$ docker -t di.registry:443/containerName
$ docker push di.registry:443/containerName
In the submission file, specify directly di.registry:443/containerName in the docker_image field.
To have a view of the already included images in the local repository, use the command:
$ curl -k https://di.registry:443/v2/_catalog
For more information on how to manipulate the department docker registry, consult the following site:
https://docs.docker.com/registry/spec/api/
Parallel Universe (for more expert users, only if you need parallelism)
The Parallel universe allows for the submission of parallel jobs. The scheduler will wait to have the required resources before executing the jobs in the required slots.
universe = parallel
executable = foo.sh
log = logfile
#input = infile.$(Node)
output = outfile.$(Node)
error = errfile.$(Node)
machine_count = 2
request_cpus = 8
should_transfer_files = IF_NEEDED
when_to_transfer_output = ON_EXIT
queue
The two parameters machine_count and request_cpus respectively describe the number of parallel processes and the required CPUs for each process.
To use Openmpi it is necessary to use a script that allows the communication between the jobs.
The script with default parameters is located in the submitter node (192.168.0.102) in the /OpenmpiScript folder. You can COPY the .sh file in your submission folder.
Note:
In this case is required to specify in the .sub file how to execute the preconfigured script and provide as the argument the mpi process (e.g., mpicc, mpif70 are already available in the submitter node PATH).
universe = parallel
executable = openmpiscript.sh
arguments = mpitest
should_transfer_files = yes
transfer_input_files = mpitest
when_to_transfer_output = on_exit_or_evict
output = logs/out.$(NODE)
error = logs/err.$(NODE)
log = logs/log
machine_count = 15
request_cpus = 10
queue
More detail in:
https://htcondor.readthedocs.io/en/latest/users-manual/parallel-applications.html
How to Manage your Jobs in HTCondor
After you launch a job, how can you know whats happening?
HTCondor provides you with some useful commands to deal with it.
Let clusterID.procID be the identifier of a job (see above "How to submit a job").
How to check the queues
condor_q
The command condor_q allows checking the status of the jobs.
Options:
-nobatch Shows one job for line (it deactivates the option -batch).
-global Shows all the active queues in the pool.
-hold Returns information about the jobs whose status is "hold". Use -analyze or --better-analyze options for more information.
-run Returns information about the jobs whose status is "running". It must be used with the -nobatch option.
How to remove a job
condor_rm clusterID.procID
It removes the clusterID.procID job from the cluster.
How to hold a job
condor_hold clusterID.procID
It puts the clusterID.procID job in "hold". In this way, it will not be scheduled until it is released (see below). If the clusterID.procID job is in the "running" status, it will be interrupted, and the node will be freed.
How to release a job
condor_release clusterID.procID
It releases the clusterID.procID job. In this way, it will leave the "hold" status and can be re-scheduled (i.e., its status is set to "idle").
For other commands, check the complete user guide:
https://htcondor.readthedocs.io/en/latest/users-manual/index.html
For a complete HTCondor user guide, please refer to the official HTCondor site:
https://htcondor.readthedocs.io/en/latest/users-manual/index.html