Torch New User Guide

Documentation Context

This documentation is for early access Torch users and provides basic guidance on using the system.

Software Support

OS: Red Hat Enterprise Linux 9.6

GPU Driver: 580.82.07 (supports to CUDA 13.0)

Access Instructions

See Torch-specific instructions here.

Storage

Notice

There is no shared filesystem between Greene and Torch.
Always transfer data manually between systems using the NYU HPC internal network.
To move data onto Torch, you should first upload it to Greene

File Transfer (Torch ↔ Greene)

From Greene to Torch:

scp -rp dtn-1:/scratch/netID/target-folder <netID>@cs649:/path/on/torch

From Torch to Greene:

scp -rp /path/on/torch dtn-1:/scratch/<netID>/target-folder

Backing Up Files to Greene Archive

You can move files from torch to Greene archive. While on Torch, you can run something like this:

scp -rp /path/on/torch dtn-1:/archive/<netID>/target-folder

Storage Quotas

As of 10/16/2025, storage quotas have been applied on Torch:

/home: 50 GB and 30K inodes (same as on Greene)
/scratch: 5 TB and 5M inodes

We are currently developing a myquota command for Torch to help users check their usage, which will be available soon.

Hardware Overview

Hardware Information is available on our Torch Announcement Page.

Job Submission (Slurm)

HPC Projects with Coldfront

All HPC sponsors (PIs) must register HPC projects at https://projects.hpc.nyu.edu/.

Each project automatically creates a corresponding Slurm account on Torch.

All Torch users will need to specify a Slurm account when submitting jobs to ensure accurate utilization reporting. Access to stakeholder resources will be managed through ColdFront

For Tandon users who already have project-related Slurm accounts on Greene, the corresponding projects have been migrated to ColdFront. However, the Slurm account names have changed. Please use the my_slurm_accounts command to view the list of Slurm accounts available to you.

Going forward, please switch to using your project’s Slurm account. Once Torch enters production, all HPC users will be required to use project-linked Slurm accounts. These accounts can also be used for CPU-only jobs on public resources, though without high priority — similar to general users.

Once an account has been created, allocated, and assigned, you can specify it in your jobs with the --account flag, for example:

#SBATCH --account=<my_slurm_account>

GPU Jobs & Limits

On Torch, there are 232 H200 GPUs, contributed primarily by stakeholders from Courant, Tandon, CDS, and individual PIs, with 24 GPUs available for public use. In addition, the cluster includes 272 L40S GPUs, most of which are available for public use.

Currently, H200 GPUs are oversubscribed, while L40S GPUs remain largely idle. If your workloads can run on L40S GPUs, please be flexible in using either GPU type by adding the following Slurm request lines:

#SBATCH --gres=gpu:1

#SBATCH --constraint="l40s|h200"

GPU Job Limits

As with Greene, each user is limited by the following job constraints:

24 GPUs in total for wall time less than 48 hours
4 GPUs in total for jobs with wall time more than 48 hours.

Stakeholder Partitions & Resources

We’ve updated the Slurm configuration to enable stakeholder partitions. The h200 and l40s partitions are preemptible, meaning jobs running there may be canceled to make room for stakeholder jobs.

Please do not specify partitions directly—just request the compute resources you need.

If you can use either L40S or H200 GPUs, request them with:

--gres=gpu:1 --constraint="l40s|h200"

Pre-Emptible Jobs

The new Slurm configuration introduces pre-emption partitions, allowing:

Non-stakeholders to temporarily use stakeholder resources
One stakeholder group to use another’s resources

Stakeholder users will continue to have normal access to their own resources. When non-stakeholders (or other stakeholders) use these resources, their jobs may be preempted—that is, cancelled to free up GPUs or CPUs once stakeholders submit jobs that require them.

Jobs are eligible for pre-emption only after running for one hour. They will not be cancelled within the first hour.

By default, jobs are not submitted to pre-emption partitions. To enable pre-emption and automatic requeueing, add:

#SBATCH --comment="preemption=yes;requeue=true"

This setting allows jobs to run in both normal and pre-emption partitions. Jobs in stakeholder partitions will not be cancelled, while those in pre-emption partitions may be.

To use only pre-emption partitions, specify:

#SBATCH --comment="preemption=yes;preemption_partitions_only=yes;requeue=true"

Pre-emption partitions are also available for CPU-only jobs. More details will be covered in the upcoming Torch Test User HPC Tutorial on October 27.

Simple Example (single H200):

sbatch --gres=gpu:1 --constraint=h200 --nodes=1 --tasks-per-node=1 \

--cpus-per-task=8 --mem=20GB --time=04:00:00 \

--wrap "hostname && sleep infinity"

Simple Example (single L40S):

sbatch --gres=gpu:1 --constraint=l40s --nodes=1 --tasks-per-node=1 \

--cpus-per-task=8 --mem=20GB --time=04:00:00 \

--wrap "hostname && sleep infinity"

Flexible GPU Request (Either H200 or L40S):

sbatch --gres=gpu:1 --constraint="h200|l40s" --nodes=1 --tasks-per-node=1 \

--cpus-per-task=8 --mem=20GB --time=04:00:00 \

--wrap "hostname && sleep infinity"

Warning: srun interactive jobs are not supported for GPU jobs at this time.

Warning: Slurm email notification is currently unavailable.

Job Requing: Enable --requeue and checkpointing — jobs may be preempted.

Apptainer / Singularity Containers

Executable Path:

Singularity can be located on torch at this directory location:

/share/apps/apptainer/bin/singularity

Required Environment Variable:

export APPTAINER_BINDPATH=/scratch,/state/partition1,/mnt,/share/apps

Key Notes:

Use read-only mode for production jobs.
Writable overlay files (e.g., for Conda) are not reliable.
Images and wrapper scripts are available in:
/share/apps/images
If you already have container-based Conda environments on Greene, they can be copied to Torch with minor edits.

Application Setup

We strongly recommend container-based setups (Singularity/Apptainer) on Torch.
- Warning: Avoid installing packages directly on the host system.
The OS and system libraries will be updated (e.g., to RHEL 10), which will likely break host-installed software.
Containers are more robust and portable.

VS Code Remote Access

For VS Code use, follow these instructions.

Below is an example configuration for your local ~/.ssh/config:

Host torch

HostName cs649

User <netID> # Update with your NetID

ProxyJump <netID>@greene.hpc.nyu.edu # Update with your NetID

StrictHostKeyChecking no

UserKnownHostsFile /dev/null

LogLevel ERROR

Make sure your local ~/.ssh/id_rsa.pub is appended to ~/.ssh/authorized_keys on both Greene and Torch.

Job Preemption Notice

Torch will implement preemptible partitions for GPU jobs. This means the following:

Stakeholder jobs may preempt public jobs when needed.
Use --requeue and implement checkpointing to resume jobs safely.

Need Help?

Please report any issues or questions via email at hpc@nyu.edu

Page updated

Report abuse