Google Cloud Bursting

Current Bursting Status

HPC may provide bursting capabilities to researchers or classes, in some cases, in order to augment the available resources. Bursting is ideal for when you need a large amount of resources for a very short period of time. 

The way that bursting is made possible is by running a scalable SLURM cluster in the Google Cloud Platform (GCP), which is separate from the on-premise HPC clusters.

Bursting is not available to all users and requires advanced approval. In order to get access to these capabilities, please contact hpc@nyu.edu to check your eligibility. Please let us know the amount of storage, total CPUs, Memory, GPU, the number of days you require access, and the estimated total CPU/GPU hours you will use. For reference, please review the GCP cost calculator. Please send a copy of your cost calculation to hpc@nyu.edu as well.

To request access to the HPC Bursting capabilities, please complete this form.

Running a Bursting Job

Note: this is not public, only per request of eligible classes or researchers

ssh <NetID>@greene.hpc.nyu.edu

ssh to the class on GCP (burst login node) - anyone can login but you can only submit jobs if you have approval

ssh burst 

Start an interactive job 

srun --account=hpc --partition=interactive --pty /bin/bash

If you got an error "Invalid account or account/partition combination specified" it means your account is not approved to use cloud bursting.

Once your files are copied to the bursting instance you can run a batch job from the interactive session. 

Access to Slurm Partitions

In the example above the partition "interactive" is used. 

You can list current partitions by running command 

sinfo

However, approval is required to submit jobs to the partitions. Partitions are set up by the resources available to a job, such as the number of CPU, amount of memory, and number of GPUs. Please email hpc@nyu.edu to request access to a specific partition or create a new partition (e.g. 10 CPUs and 64 GB Memory) for more optimal cost/performance of your job.

Current Limits

20,000 CPUs available at any given time for all active bursting users

Storage

Greene's /home and /scratch are mounted (available) at login node of bursting setup.

Compute node however, do have independent /home and /scratch.  These /home and /scratch mounts are persistent, are available from any compute node and independent from /home and /scratch at Greene.

User may need to copy data from Greene's /home or /scratch to GCP mounted /home or /scratch

When you run a bursting job the compute nodes will not see those file mounts. This means that you need to copy data to the burst instance.

The file systems are independent, so you must copy data to the GCP location.

To copy data, you must first start an interactive job. Once started, you can copy your data using scp from the HPC Data Transfer Nodes (greene-dtn). Below is the basic setup to copy files from Greene to your home directory while you are in an interactive bursting job:

scp <NetID>@greene-dtn.hpc.nyu.edu:/path/to/files /home/<NetID>/

Visualization Workstations

The burst cluster includes a partition (nvgrid) that can be used to run graphical applications on NVIDIA GPUs for visualization purposes.  You can use this partition by following the instructions below. 

Host burst

  HostName burst.hpc.nyu.edu

  User <NetID>

  ProxyJump <NetID>@greene.hpc.nyu.edu

  ProxyJump <NetID>@burst.hpc.nyu.edu

  StrictHostKeyChecking no

  UserKnownHostsFile /dev/null

  LogLevel ERROR

srun --account=hpc --partition=nvgrid --gres=gpu:p100:1 --pty /bin/bash

/opt/TurboVNC/bin/vncserver

[jp6546@b-23-1 ~]$ squeue

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)

             92727    nvgrid     bash  jp6546  R       2:55      1 b-23-1

ssh -N -L 5901:<Hostname>:5901 <NetID>@burst

  This command will ensure that you can connect to the remote desktop service from your local computer.

#!/bin/bash

#SBATCH --gres=gpu:p100:1

#SBATCH --partition=nvgrid

#SBATCH --account=hpc

#SBATCH --job-name=vnc

#SBATCH --time=1:00

#SBATCH --output=slurm_%j.out


/opt/TurboVNC/bin/vncserver


sleep 3600