This documentation is for early access Torch users and provides basic guidance on using the system.
OS: Red Hat Enterprise Linux 9.6
GPU Driver: 580.82.07 (supports to CUDA 13.0)
See Torch-specific instructions here.
There is no shared filesystem between Greene and Torch.
Always transfer data manually between systems using the NYU HPC internal network.
To move data onto Torch, you should first upload it to Greene
From Greene to Torch:
scp -rp dtn-1:/scratch/netID/target-folder <netID>@cs649:/path/on/torch
From Torch to Greene:
scp -rp /path/on/torch dtn-1:/scratch/<netID>/target-folder
You can move files from torch to Greene archive. While on Torch, you can run something like this:
scp -rp /path/on/torch dtn-1:/archive/<netID>/target-folder
As of 10/16/2025, storage quotas have been applied on Torch:
/home: 50 GB and 30K inodes (same as on Greene)
/scratch: 5 TB and 5M inodes
We are currently developing a myquota command for Torch to help users check their usage, which will be available soon.
Hardware Information is available on our Torch Announcement Page.
All HPC sponsors (PIs) must register HPC projects at https://projects.hpc.nyu.edu/.
Each project automatically creates a corresponding Slurm account on Torch.
All Torch users will need to specify a Slurm account when submitting jobs to ensure accurate utilization reporting. Access to stakeholder resources will be managed through ColdFront
For Tandon users who already have project-related Slurm accounts on Greene, the corresponding projects have been migrated to ColdFront. However, the Slurm account names have changed. Please use the my_slurm_accounts command to view the list of Slurm accounts available to you.
Going forward, please switch to using your project’s Slurm account. Once Torch enters production, all HPC users will be required to use project-linked Slurm accounts. These accounts can also be used for CPU-only jobs on public resources, though without high priority — similar to general users.
Once an account has been created, allocated, and assigned, you can specify it in your jobs with the --account flag, for example:
#SBATCH --account=<my_slurm_account>
On Torch, there are 232 H200 GPUs, contributed primarily by stakeholders from Courant, Tandon, CDS, and individual PIs, with 24 GPUs available for public use. In addition, the cluster includes 272 L40S GPUs, most of which are available for public use.
Currently, H200 GPUs are oversubscribed, while L40S GPUs remain largely idle. If your workloads can run on L40S GPUs, please be flexible in using either GPU type by adding the following Slurm request lines:
#SBATCH --gres=gpu:1
#SBATCH --constraint="l40s|h200"
GPU Job Limits
As with Greene, each user is limited by the following job constraints:
24 GPUs in total for wall time less than 48 hours
4 GPUs in total for jobs with wall time more than 48 hours.
We’ve updated the Slurm configuration to enable stakeholder partitions. The h200 and l40s partitions are preemptible, meaning jobs running there may be canceled to make room for stakeholder jobs.
Please do not specify partitions directly—just request the compute resources you need.
If you can use either L40S or H200 GPUs, request them with:
--gres=gpu:1 --constraint="l40s|h200"
The new Slurm configuration introduces pre-emption partitions, allowing:
Non-stakeholders to temporarily use stakeholder resources
One stakeholder group to use another’s resources
Stakeholder users will continue to have normal access to their own resources. When non-stakeholders (or other stakeholders) use these resources, their jobs may be preempted—that is, cancelled to free up GPUs or CPUs once stakeholders submit jobs that require them.
Jobs are eligible for pre-emption only after running for one hour. They will not be cancelled within the first hour.
By default, jobs are not submitted to pre-emption partitions. To enable pre-emption and automatic requeueing, add:
#SBATCH --comment="preemption=yes;requeue=true"
This setting allows jobs to run in both normal and pre-emption partitions. Jobs in stakeholder partitions will not be cancelled, while those in pre-emption partitions may be.
To use only pre-emption partitions, specify:
#SBATCH --comment="preemption=yes;preemption_partitions_only=yes;requeue=true"
Pre-emption partitions are also available for CPU-only jobs. More details will be covered in the upcoming Torch Test User HPC Tutorial on October 27.
sbatch --gres=gpu:1 --constraint=h200 --nodes=1 --tasks-per-node=1 \
--cpus-per-task=8 --mem=20GB --time=04:00:00 \
--wrap "hostname && sleep infinity"
sbatch --gres=gpu:1 --constraint=l40s --nodes=1 --tasks-per-node=1 \
--cpus-per-task=8 --mem=20GB --time=04:00:00 \
--wrap "hostname && sleep infinity"
sbatch --gres=gpu:1 --constraint="h200|l40s" --nodes=1 --tasks-per-node=1 \
--cpus-per-task=8 --mem=20GB --time=04:00:00 \
--wrap "hostname && sleep infinity"
Warning: srun interactive jobs are not supported for GPU jobs at this time.
Warning: Slurm email notification is currently unavailable.
Job Requing: Enable --requeue and checkpointing — jobs may be preempted.
Singularity can be located on torch at this directory location:
/share/apps/apptainer/bin/singularity
export APPTAINER_BINDPATH=/scratch,/state/partition1,/mnt,/share/apps
Use read-only mode for production jobs.
Writable overlay files (e.g., for Conda) are not reliable.
Images and wrapper scripts are available in:
/share/apps/images
If you already have container-based Conda environments on Greene, they can be copied to Torch with minor edits.
We strongly recommend container-based setups (Singularity/Apptainer) on Torch.
Warning: Avoid installing packages directly on the host system.
The OS and system libraries will be updated (e.g., to RHEL 10), which will likely break host-installed software.
Containers are more robust and portable.
For VS Code use, follow these instructions.
Below is an example configuration for your local ~/.ssh/config:
Host torch
HostName cs649
User <netID> # Update with your NetID
ProxyJump <netID>@greene.hpc.nyu.edu # Update with your NetID
StrictHostKeyChecking no
UserKnownHostsFile /dev/null
LogLevel ERROR
Make sure your local ~/.ssh/id_rsa.pub is appended to ~/.ssh/authorized_keys on both Greene and Torch.
Torch will implement preemptible partitions for GPU jobs. This means the following:
Stakeholder jobs may preempt public jobs when needed.
Use --requeue and implement checkpointing to resume jobs safely.
Please report any issues or questions via email at hpc@nyu.edu