HPC Cluster

User Guide

A brief user guide for cluster usage.

Login

How to access the cluster

Now you are inside the submitter node where you can manage your jobs.

What you need to use: HTCondor

HTCondor is a software system that creates a High-Throughput Computing (HTC) environment. It effectively uses the computing power of machines connected over a network, i.e., the department's cluster. Its power comes from the ability to effectively harness shared resources with distributed ownership. 

In other words, you need HTCondor to launch and manage your job (the programs and their respective environment) into the cluster.

Some preliminaries:


Since "home" is shared between each cluster node, your data can be used in all the nodes, including the submitter (where you can manage your jobs).


How to manage your jobs

Once you are logged into the cluster (see above), you are located in the HTCondor submitter node, in which is possible to build, start, remove, and manage your jobs. Moreover, it is possible to access the internal registry. 

Note: DO NOT build a docker on the submitter frontend. 

How to Submit a Job in HTCondor

How to submit a job

condor_submit example.sub

The command condor_submit is required to submit a (set of) job(s) to the central manager. The number of jobs must be specified in the .sub file (see below for some examples). The central manager will choose the destination workernode(s) based on the chosen characteristics/job requirements. 

Example of .sub file.

The .sub file is the job descriptor, which must contain the configuration of the job. It is structured as a dictionary, as shown in the example.

Example of .sub file for the Vanilla Universe (see below for a more comprehensive description of Universe)

*     Executable    = foo.sh

*     Universe      = vanilla

      Error         = bar/err.$(ClusterId).$(ProcId)

      Output        = bar/out.$(ClusterId).$(ProcId)   

*     Log           = bar/log.$(ClusterId).$(ProcId)

      arguments     = arg1 arg2 $(ProcId)

      request_cpus   = 1

      request_memory = 1024

      request_disk   = 10240

      should_transfer_files = yes

*     Queue 1

 *  represents required fields.

Parameters description

Executable = job to execute

Universe = vanilla or docker

Error, Output, Log = Error, Output e Log files to use. In the example, they are identified by ClusterID and ProcID variables. The folder that contains these files MUST exist before the job is executed, otherwise, it will generate an error and block the execution.

Arguments = Input(s) that is(are) provided as a string array to the job.

request_* = required characteristics of the workernode which to submit the job. Even if they are not strictly required, it is suggested to provide them for an ideal use of the infrastructure.

Queue = This parameter, by default with value = 1, allow to run the same job several times. Each of these jobs will have an increasing value for the variable $ProcId.

Some more complex examples of .sub file

Example 1:

Launch bunch of jobs (3) with different arguments (with the same ClusterID for the bunch)

universe = vanilla

Executable = foo.sh

Log             = bar.$(ClusterId).$(ProcID).log

Output         = bar.$(ClusterId).$(ProcID).out

Error           = bar.$(ClusterId).$(ProcID).err

Arguments  = neutron

Queue

Arguments  = tty

Queue 

Arguments  = httbar

Queue


The same task can be also submitted like this:

 

universe = vanilla

Executable   = foo.sh

Log             = bar.$(ClusterId).$(ProcID).log

Output         = bar.$(ClusterId).$(ProcID).out

Error           = bar.$(ClusterId).$(ProcID).err

Arguments  = $(Item)

Queue 1 in (neutron, tty, httbar)


where $(item) is the variable representing the values provided in Queue.

 

Example 2:

Launch of 3 jobs with a pair of arguments (1, foo) (2, bar)…

 

universe     = vanilla

executable   = test.sh

arguments    = $(IVAL) $(JVAL)

output      = output_test_for_$(Process).txt

+MountData1 = False

+MountData2 = False

+MountHomes = True

ILIST = 1,2,3

I = $(Process)

IVAL = $CHOICE(I, ILIST)

JLIST = foo, bar, baz, foobar

JVAL = $CHOICE(IVAL, JLIST)

queue 3


For more information, please consult:
https://htcondor.readthedocs.io/en/latest/users-manual/submitting-a-job.html?highlight=submit

HTCondor' Universes

Universe


The universe is the job executing ambient. HTCondor can manage several universe:

 

vanilla grid java scheduler local

parallel vm container docker


A more comprehensive list and guide is provided here: 

https://htcondor.readthedocs.io/en/latest/users-manual/choosing-an-htcondor-universe.html?highlight=universe


Note: The libraries are not and will never be directly installed in the nodes. This is to avoid compatibility problems among different research groups. However, you can build your docker (see below) to setup your environment).


Docker Universe (suggested)


This universe allows for the use of public or handmade containers. In this way, you can specify the all the libraries that you project needs and auxiliary file and directories. An example of a .sub file that uses this universe is provided here:


universe                = docker

docker_image            = ubuntu:20.04

executable              = /bin/cat

arguments               = /etc/hosts

transfer_input_files    = input.txt

transfer_output_files   = output.txt

should_transfer_files   = YES

when_to_transfer_output = ON_EXIT

output                  = out.$(ClusterId).$(ProcId)

error                   = err.$(ClusterId).$(ProcId)

log                     = log.$(ClusterId).$(ProcId)

request_cpus            = 1

request_memory          = 256M

request_disk            = 1024M

+MountData1 = FALSE

+MountData2 = FALSE

+MountHomes = FALSE

queue 1


The field docker_image specify the docker image to download from Docker Hub.

The field should_transfer_files specify whether to copy the file in the container or not.

The fields transfer_input_files and transfer_output_files specify the files and directories that need to be transferred in the container.

These options are not necessary if the files are in the shared volumes.


The options MountData1,MountData2,MountHomes are used to specify to HTcondor whether the shared volumes must be visible or not inside the container.

 

More information here:

https://htcondor.readthedocs.io/en/latest/users-manual/docker-universe-applications.html

 

Note:

The allowed users can push their handmade docker images on the local repository, called di.registry:443 with these commands:

$ docker -t di.registry:443/containerName

$ docker push di.registry:443/containerName

In the submission file, specify directly di.registry:443/containerName in the docker_image field.

 

To have a view of the already included images in the local repository, use the command:

 

$ curl -k https://di.registry:443/v2/_catalog


For more information on how to manipulate the department docker registry, consult the following site:

https://docs.docker.com/registry/spec/api/


Parallel Universe (for more expert users, only if you need parallelism)


The Parallel universe allows for the submission of parallel jobs. The scheduler will wait to have the required resources before executing the jobs in the required slots.

 

universe = parallel

executable = foo.sh

log = logfile

#input = infile.$(Node)

output = outfile.$(Node)

error = errfile.$(Node)

machine_count = 2

request_cpus = 8

should_transfer_files = IF_NEEDED

when_to_transfer_output = ON_EXIT

queue


The two parameters machine_count and request_cpus respectively describe the number of parallel processes and the required CPUs for each process.


To use Openmpi it is necessary to use a script that allows the communication between the jobs.

The script with default parameters is located in the submitter node (192.168.0.102) in the /OpenmpiScript folder. You can COPY the .sh file in your submission folder.

 

Note: 

In this case is required to specify in the .sub file how to execute the preconfigured script and provide as the argument the mpi process (e.g., mpicc, mpif70 are already available in the submitter node PATH).

 

universe = parallel

executable = openmpiscript.sh

arguments = mpitest

should_transfer_files = yes

transfer_input_files = mpitest

when_to_transfer_output = on_exit_or_evict

output = logs/out.$(NODE)

error  = logs/err.$(NODE)

log    = logs/log

machine_count = 15

request_cpus = 10

queue

 

More detail in: 

https://htcondor.readthedocs.io/en/latest/users-manual/parallel-applications.html

How to Manage your Jobs in HTCondor

After you launch a job, how can you know whats happening?

HTCondor provides you with some useful commands to deal with it.


Let clusterID.procID be the identifier of a job  (see above "How to submit a job"). 


How to check the queues

condor_q

The command condor_q allows checking the status of the jobs.

Options:


How to remove a job

condor_rm clusterID.procID

It removes the clusterID.procID job from the cluster.


How to hold a job

condor_hold clusterID.procID

It puts the clusterID.procID job in "hold". In this way, it will not be scheduled until it is released (see below). If the clusterID.procID job is in the "running" status, it will be interrupted, and the node will be freed. 


How to release a job

condor_release clusterID.procID

It releases the clusterID.procID job. In this way, it will leave the "hold" status and can be re-scheduled (i.e., its status is set to "idle").


For other commands, check the complete user guide:

https://htcondor.readthedocs.io/en/latest/users-manual/index.html

For a complete HTCondor user guide, please refer to the official HTCondor site:
https://htcondor.readthedocs.io/en/latest/users-manual/index.html