This documentation is for early access Torch users and provides basic guidance on using the system.
OS: Red Hat Enterprise Linux 9.6
GPU Driver: 580.82.07 (supports to CUDA 13.0)
Login to Greene:
ssh greene.hpc.nyu.edu
From Greene, connect to Torch:
ssh cs649
There is no shared filesystem between Greene and Torch.
Always transfer data manually between systems using the NYU HPC internal network.
To move data onto Torch, you should first upload it to Greene
From Greene to Torch:
scp -rp dtn-1:/scratch/netID/target-folder sw77@cs649:/path/on/torch
From Torch to Greene:
scp -rp /path/on/torch dtn-1:/scratch/<netID>/target-folder
You can move files from torch to Greene archive. While on Torch, you can run something like this:
scp -rp /path/on/torch dtn-1:/archive/<netID>/target-folder
Hardware Information is available on our Torch Announcement Page.
As with Greene, each user is limited by the following job constraints:
24 GPUs in total for wall time less than 48 hours
4 GPUs in total for jobs with wall time more than 48 hours.
We’ve updated the Slurm configuration to enable stakeholder partitions. The h200 and l40s partitions are preemptible, meaning jobs running there may be canceled to make room for stakeholder jobs.
Please do not specify partitions directly—just request the compute resources you need.
If you can use either L40S or H200 GPUs, request them with:
--gres=gpu:1 --constraint="l40s|h200"
If you are willing to run on preemptible partitions (h200 and l40s), please add:
--comment="preemption=yes;requeue=yes"
Preempted jobs will be requeued automatically, so please make sure these jobs can restart from checkpoint/restart files.
sbatch --gres=gpu:1 --constraint=h200 --nodes=1 --tasks-per-node=1 \
--cpus-per-task=8 --mem=20GB --time=04:00:00 \
--wrap "hostname && sleep infinity"
sbatch --gres=gpu:1 --constraint=l40s --nodes=1 --tasks-per-node=1 \
--cpus-per-task=8 --mem=20GB --time=04:00:00 \
--wrap "hostname && sleep infinity"
sbatch --gres=gpu:1 --constraint="h200|l40s" --nodes=1 --tasks-per-node=1 \
--cpus-per-task=8 --mem=20GB --time=04:00:00 \
--wrap "hostname && sleep infinity"
Warning: srun interactive jobs are not supported for GPU jobs at this time.
Warning: Slurm email notification is currently unavailable.
Job Requing: Enable --requeue and checkpointing — jobs may be preempted.
Singularity can be located on torch at this directory location:
/share/apps/apptainer/bin/singularity
export APPTAINER_BINDPATH=/scratch,/state/partition1,/mnt,/share/apps
Use read-only mode for production jobs.
Writable overlay files (e.g., for Conda) are not reliable.
Images and wrapper scripts are available in:
/share/apps/images
If you already have container-based Conda environments on Greene, they can be copied to Torch with minor edits.
We strongly recommend container-based setups (Singularity/Apptainer) on Torch.
Warning: Avoid installing packages directly on the host system.
The OS and system libraries will be updated (e.g., to RHEL 10), which will likely break host-installed software.
Containers are more robust and portable.
For VS Code use, follow these instructions.
Below is an example configuration for your local ~/.ssh/config:
Host torch
HostName cs649
User sw77 # Update with your NetID
ProxyJump sw77@greene.hpc.nyu.edu # Update with your NetID
StrictHostKeyChecking no
UserKnownHostsFile /dev/null
LogLevel ERROR
Make sure your local ~/.ssh/id_rsa.pub is appended to ~/.ssh/authorized_keys on both Greene and Torch.
Torch will implement preemptible partitions for GPU jobs. This means the following:
Stakeholder jobs may preempt public jobs when needed.
Use --requeue and implement checkpointing to resume jobs safely.
Please report any issues or questions via email at hpc@nyu.edu