Torch New User Guide

There is no shared filesystem between Greene and Torch.
Always transfer data manually between systems using the NYU HPC internal network.
To move data onto Torch, you should first upload it to Greene

File Transfer (Torch ↔ Greene)

From Greene to Torch:

scp -rp dtn-1:/scratch/netID/target-folder sw77@cs649:/path/on/torch

From Torch to Greene:

scp -rp /path/on/torch dtn-1:/scratch/<netID>/target-folder

Backing Up Files to Greene Archive

You can move files from torch to Greene archive. While on Torch, you can run something like this:

scp -rp /path/on/torch dtn-1:/archive/<netID>/target-folder

Hardware Overview

Hardware Information is available on our Torch Announcement Page.

Job Submission (Slurm)

GPU Job Limits

As with Greene, each user is limited by the following job constraints:

24 GPUs in total for wall time less than 48 hours
4 GPUs in total for jobs with wall time more than 48 hours.

Stakeholder Partitions & Resources

We’ve updated the Slurm configuration to enable stakeholder partitions. The h200 and l40s partitions are preemptible, meaning jobs running there may be canceled to make room for stakeholder jobs.

Please do not specify partitions directly—just request the compute resources you need.

If you can use either L40S or H200 GPUs, request them with:

--gres=gpu:1 --constraint="l40s|h200"

Preemptible Jobs

If you are willing to run on preemptible partitions (h200 and l40s), please add:

--comment="preemption=yes;requeue=yes"

Preempted jobs will be requeued automatically, so please make sure these jobs can restart from checkpoint/restart files.

Simple Example (single H200):

sbatch --gres=gpu:1 --constraint=h200 --nodes=1 --tasks-per-node=1 \

--cpus-per-task=8 --mem=20GB --time=04:00:00 \

--wrap "hostname && sleep infinity"

Simple Example (single L40S):

sbatch --gres=gpu:1 --constraint=l40s --nodes=1 --tasks-per-node=1 \

--cpus-per-task=8 --mem=20GB --time=04:00:00 \

--wrap "hostname && sleep infinity"

Flexible GPU Request (Either H200 or L40S):

sbatch --gres=gpu:1 --constraint="h200|l40s" --nodes=1 --tasks-per-node=1 \

--cpus-per-task=8 --mem=20GB --time=04:00:00 \

--wrap "hostname && sleep infinity"

Warning: srun interactive jobs are not supported for GPU jobs at this time.

Warning: Slurm email notification is currently unavailable.

Job Requing: Enable --requeue and checkpointing — jobs may be preempted.

Apptainer / Singularity Containers

Executable Path:

Singularity can be located on torch at this directory location:

/share/apps/apptainer/bin/singularity

Required Environment Variable:

export APPTAINER_BINDPATH=/scratch,/state/partition1,/mnt,/share/apps

Key Notes:

Use read-only mode for production jobs.
Writable overlay files (e.g., for Conda) are not reliable.
Images and wrapper scripts are available in:
/share/apps/images
If you already have container-based Conda environments on Greene, they can be copied to Torch with minor edits.

Application Setup

We strongly recommend container-based setups (Singularity/Apptainer) on Torch.
- Warning: Avoid installing packages directly on the host system.
The OS and system libraries will be updated (e.g., to RHEL 10), which will likely break host-installed software.
Containers are more robust and portable.

VS Code Remote Access

For VS Code use, follow these instructions.

Below is an example configuration for your local ~/.ssh/config:

Host torch

HostName cs649

User sw77 # Update with your NetID

ProxyJump sw77@greene.hpc.nyu.edu # Update with your NetID

StrictHostKeyChecking no

UserKnownHostsFile /dev/null

LogLevel ERROR

Make sure your local ~/.ssh/id_rsa.pub is appended to ~/.ssh/authorized_keys on both Greene and Torch.

Job Preemption Notice

Torch will implement preemptible partitions for GPU jobs. This means the following:

Stakeholder jobs may preempt public jobs when needed.
Use --requeue and implement checkpointing to resume jobs safely.

Need Help?

Please report any issues or questions via email at hpc@nyu.edu