Caliburn & ELF

Caliburn and ELF Availability

The JAN-JUN 2021 allocation period for Caliburn ended on June 30. There will not be another allocation period. Caliburn and ELF will remain available for open-access use as long as we can keep those systems operational without support. With an open-access configuration, any NJ-based researchers may gain access and there will be no dedicated compute or storage allocations. Ongoing use of Caliburn and ELF must be done "at your own risk" because the storage systems are not backed-up, the infrastructure and software are no longer supported, and a hardware failure could end access to those systems at any time. We hope to keep these systems operational for a few months.

Overview of Caliburn and ELF

Caliburn includes 560 nodes, each with 36 Intel Xeon E5-2695v4 cores, 256 GB RAM, and a single 400 GB Intel NVMe drive. All of the nodes are interconnected using the Intel Omni-Path fabric. Calibrun is managed using 6-month compute and storage allocation process.

ELF (the original Equipment Leasing Fund cluster) includes 144 nodes, each with 24 Intel Xeon E5-2680 v3 cores, 256 GB RAM, and a single 300G 10K hard drive. All of the nodes are interconnected using a Mellanox FDR InfiniBand fabric. Eight of the nodes have dual NVIDIA Tesla K40m GPUs. Four nodes have 768 GB RAM. ELF is managed in an open-access fashion for short-term projects.

Allocation Process

The Caliburn system is no longer available for allocations, so the following information is for reference only.

Access to Caliburn was provided through a proposal and allocation process where research proposals from anyone affiliated with a NJ-based organization could submit an allocation request. Successful proposals were granted access to a specified number of service units (SUs, similar to core-hours) and associated storage for a 6-month allocation period. The last allocation period was January 1 through June 30, 2021.

Temporary access to ELF was available to any NJ researcher for durations ranging from a few days to a few weeks. Temporary access requests followed the same procedure as the regular research proposal process, but (1) the submitted proposal could be shorter than a full research proposal and (2) access was granted on an individual researcher basis (e.g., not an entire group with multiple users). These requests had to be approved, so they were required to include all relevant details about what software would be used and how the compute resources would be used.

Resources on both Caliburn and ELF were scheduled per-node and not per-core as on Amarel, so the way researchers arrange their workflows was a bit different.

Both the Caliburn and ELF systems were scheduled for retirement in June 2021. These systems will not be integrated into the Amarel cluster and are not expected to remain available for use in any form.

Connecting to the cluster

If you are connecting from a location outside the Rutgers campus network, you must first connect to the campus network using the Rutgers VPN (virtual private network) service. See here for details and support: https://soc.rutgers.edu/vpn

Command-line access via SSH

Accessing the cluster using a command-line interface is done using an SSH (Secure Shell) connection:

ssh <NetID>@caliburn.rutgers.edu

ssh <NetID>@elf.rutgers.edu

The password to use is your standard NetID password. If you're having trouble with your NetID or password, please see https://netid.rutgers.edu for tools and support.

Moving files to/from the cluster

There are many different ways to this: secure copy (scp), remote sync (rsync), an FTP client (FileZilla), etc.

Let’s assume you’re logged-in to a local workstation or laptop and not connected to Caliburn. To send files from your local system to your Caliburn /home1 directory,

scp file-1.txt file-2.txt <NetID>@caliburn.rutgers.edu:/home1/<NetID>

To pull a file from your Caliburn /home directory to your laptop (note the “.” at the end of this command),

scp <NetID>@caliburn.rutgers.edu:/home1/<NetID>/file-1.txt .

If you want to copy an entire directory and its contents using scp, you’ll need to “package” your directory into a single, compressed file before moving it:

tar -czf my-directory.tar.gz my-directory

After moving it, you can unpack that .tar.gz file to get your original directory and contents:

tar -xzf my-directory.tar.gz

A handy way to synchronize a local file or entire directory between your local workstation and the Caliburn cluster is to use the rsync utility. First, let's sync a local (recently updated) directory with the same directory stored on Caliburn:

rsync -trlvpz work-dir gc563@caliburn.rutgers.edu:/home1/gc563/work-dir

In this example, the rsync options I'm using are:

  • t (preserve modification times)

  • r (recursive, sync all subdirectories)

  • l (preserve symbolic links)

  • v (verbose, show all details)

  • p (preserve permissions)

  • z (compress transferred data)

To sync a local directory with updated data from Amarel:

rsync -trlvpz gc563@caliburn.rutgers.edu:/home1/gc563/work-dir work-dir

Here, we've simply reversed the order of the local and remote locations.

For added security, you can use SSH for the data transfer by adding the e option followed by the protocol name (SSH, in this case):

rsync -trlvpze ssh gc563@caliburn.rutgers.edu:/home1/gc563/work-dir work-dir

Important notes about storage on Caliburn

Caliburn's storage systems are not backed-up. The resources made available for operating Caliburn do not include a back-up system, so it's important for research users to note that no data stored within the Caliburn or ELF systems is protected with back-ups.

We strongly encourage all users to back-up important data elsewhere (e.g., Box.rutgers.edu, Google Drive, O365 OneDrive, /home or /projects directories of the Amarel cluster) and please consider the sensitivity of your data before selecting a remote storage resource (e.g., Rutgers data classification and storage matrix).

Checking storage utilization

When you run out of available storage space in /home, /scratch, or /projects, that directory becomes unusable and files must be deleted or moved before you will be able to use that space again.

All Caliburn users can use the mmlsquota command to display information about quota limits.

Checking storage utilization in my $HOME directory:

[gc563@caliburn2 ~]$ quota -vs -f /home1

Disk quotas for user gc563 (uid 2275):

Filesystem space quota limit grace files quota limit grace

storage:/data/home1

8032M 10240M 10240M 77869 0 0

I'm using about 8 GB in my $HOME directory and I have a quota of 10 GB, so I have only about 2 GB of space remaining to use.

Checking storage utilization in my $SCRATCH directory:

[gc563@caliburn2 PNP]$ mmlsquota --block-size=auto gpfs | (head -2 && tail -2 | head -1)

Block Limits | File Limits

Filesystem Fileset type blocks quota limit in_doubt grace | files quota limit in_doubt grace Remarks

gpfs scratch USR 45.04G 100G 100G 0 none | 3523 0 0 0 none gpfs.nsd1

I'm using about 45 GB in my $SCRATCH directory and I have a quota of 100 GB, so I have about 54 GB of space remaining to use.

Checking storage utilization in a project's directories ($PROJECT and $STAGING). In this case, looking at the "jbv9-001" project:

[gc563@caliburn2 PNP]$ mmlsquota -g jbv9-001 --block-size=auto gpfs | (head -4 && tail -4 | head -2)

Disk quotas for group jbv9-001 (gid 5023):

Block Limits | File Limits

Filesystem Fileset type blocks quota limit in_doubt grace | files quota limit in_doubt grace Remarks

gpfs project1 GRP 46.06G 100G 100G 0 none | 423626 0 0 0 none gpfs.nsd1

gpfs staging GRP 0 100G 100G 0 none | 2 0 0 0 none gpfs.nsd1

I'm using about 46 GB in my $PROJECT/jbv9-001 directory, not yet using any of the space in my $STAGING/jbv9-001 directory, and I have a quota of 100 GB, so I have about 54 GB of space remaining to use.

Specifying job resource limits

Caliburn and ELF resources are allocated in a "per node" fashion, which means you will be granted exclusive access to each node assigned to your job. For that reason, there is no need to set memory limits for a job.

Partition and QoS selection:

Resources on Caliburn are managed using a combination of partitions and associated QoS which enables the setting of limits to help control the % utilization and fair-sharing of the available compute nodes.

The available partitions are

large

xlarge

long

serial

short

largemem

main

gpu (only for the ELF system, not Caliburn)

In addition, there is an open-access main partition.

For all jobs, you must specify a valid partition and QoS combination,

#SBATCH --partition=<partition>

#SBATCH --qos=<partition_allowed_qos>

For example, for running a short (4-hour) job associated with an awarded allocation, I would use

#SBATCH --partition=short

#SBATCH --qos=short-award

The table below summarizes the limits associated with each available QoS (max wall times are in DD-HH:MM:SS format):

Caliburn Resource Limits