Google Cloud Bursting

Current Bursting Status

HPC may provide bursting capabilities to researchers or classes, in some cases, in order to augment the available resources. Bursting is ideal for when you need a large amount of resources for a very short period of time.

The way that bursting is made possible is by running a scalable SLURM cluster in the Google Cloud Platform (GCP), which is separate from the on-premise HPC clusters.

Bursting is not available to all users and requires advanced approval. In order to get access to these capabilities, please contact hpc@nyu.edu to check your eligibility. Please let us know the amount of storage, total CPUs, Memory, GPU, the number of days you require access, and the estimated total CPU/GPU hours you will use. For reference, please review the GCP cost calculator. Please send a copy of your cost calculation to hpc@nyu.edu as well.

To request access to the HPC Bursting capabilities, please complete this form.

Running a Bursting Job

Note: this is not public, only per request of eligible classes or researchers

ssh <NetID>@greene.hpc.nyu.edu

ssh to the class on GCP (burst login node) - anyone can login but you can only submit jobs if you have approval

ssh burst

Start an interactive job

srun --account=hpc --partition=interactive --pty /bin/bash

If you got an error "Invalid account or account/partition combination specified" it means your account is not approved to use cloud bursting.

Once your files are copied to the bursting instance you can run a batch job from the interactive session.

Access to Slurm Partitions

In the example above the partition "interactive" is used.

You can list current partitions by running command

sinfo

However, approval is required to submit jobs to the partitions. Partitions are set up by the resources available to a job, such as the number of CPU, amount of memory, and number of GPUs. Please email hpc@nyu.edu to request access to a specific partition or create a new partition (e.g. 10 CPUs and 64 GB Memory) for more optimal cost/performance of your job.

Current Limits

20,000 CPUs available at any given time for all active bursting users

Storage

Greene's /home and /scratch are mounted (available) at login node of bursting setup.

Compute node however, do have independent /home and /scratch. These /home and /scratch mounts are persistent, are available from any compute node and independent from /home and /scratch at Greene.

User may need to copy data from Greene's /home or /scratch to GCP mounted /home or /scratch

When you run a bursting job the compute nodes will not see those file mounts. This means that you need to copy data to the burst instance.

The file systems are independent, so you must copy data to the GCP location.

To copy data, you must first start an interactive job. Once started, you can copy your data using scp from the HPC Data Transfer Nodes (greene-dtn). Below is the basic setup to copy files from Greene to your home directory while you are in an interactive bursting job:

scp <NetID>@greene-dtn.hpc.nyu.edu:/path/to/files /home/<NetID>/

Visualization Workstations

The burst cluster includes a partition (nvgrid) that can be used to run graphical applications on NVIDIA GPUs for visualization purposes. You can use this partition by following the instructions below.

Add the following to your SSH config file (~/.ssh/config) on your local workstation so that you can log into the burst login node directly:

Host burst

HostName burst.hpc.nyu.edu

User <NetID>

ProxyJump <NetID>@greene.hpc.nyu.edu

ProxyJump <NetID>@burst.hpc.nyu.edu

StrictHostKeyChecking no

UserKnownHostsFile /dev/null

LogLevel ERROR

Log into the burst login node by running ssh <NetID>@burst while on-campus or connected to the VPN. Run the following command on the login node to request an interactive command line session:

srun --account=hpc --partition=nvgrid --gres=gpu:p100:1 --pty /bin/bash

When your interaction session is active, run the following command to start the VNC (remote desktop) server. If this is the first time you’ve used a visualization node, you will be prompted to set a password to use when you access your remote session:

/opt/TurboVNC/bin/vncserver

Note the hostname of the node that you are running on. This hostname is displayed in the NODELIST column of the output from the squeue command:

[jp6546@b-23-1 ~]$ squeue

JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)

…

92727 nvgrid bash jp6546 R 2:55 1 b-23-1

In another terminal on your local machine, run the following command:

ssh -N -L 5901:<Hostname>:5901 <NetID>@burst

This command will ensure that you can connect to the remote desktop service from your local computer.

If you do not already have a VNC remote desktop client installed on your computer, you will need to install one. A list of VNC clients available for various platforms can be found here. Note that Mac OS X comes with a built-in VNC client, which is accessible from the Finder by navigating to Go → Connect to Server and then typing vnc:// at the beginning of the server field.
Within your VNC client, connect to localhost:5901 (vnc://localhost:5901 on Mac OS X).
You should now be presented with a desktop environment. If you are using any OpenGL-based applications that are started from a terminal, be sure to type vglrun before the command name in order to ensure that the application uses the GPU.
After your first time using the nvgrid partition, you can start the remote desktop server non-interactively using the following batch script (although you will need to remember the password that you set in step 3). Note that the sleep command should have the length of time that you want the server to run (in seconds) after it (3600 seconds for 1 hour in the example below).

#!/bin/bash

#SBATCH --gres=gpu:p100:1

#SBATCH --partition=nvgrid

#SBATCH --account=hpc

#SBATCH --job-name=vnc

#SBATCH --time=1:00

#SBATCH --output=slurm_%j.out

/opt/TurboVNC/bin/vncserver

sleep 3600

Page updated

Report abuse