This documentation is for early access Torch users and provides basic guidance on using the system.
OS: Red Hat Enterprise Linux 9.6
GPU Driver: 580.82.07 (supports to CUDA 13.0)
See Torch-specific instructions here.
There is no shared filesystem between Greene and Torch.
Always transfer data manually between systems using the NYU HPC internal network.
Torch has an available data transfer node (DTN) that can be used to upload data. You must authenticate similarly to the steps outlined on the Accessing Torch page.
netid@localpc ~ $ rsync -avz -e ssh ~/testfile.txt netid@dtn012.hpc.nyu.edu:/scratch/netid/testfile.txt
(netid@dtn012.hpc.nyu.edu) Authenticate with PIN BUUM5YHJX at https://microsoft.com/devicelogin and press ENTER.
testfile.txt 100% 0 0.0KB/s 00:00
netid@localpc ~ $
Globus has been installed for Torch - you can log in at https://globus.org/ and search for the Torch home or scratch collections:
Globus works much the same as it did on Greene, so see our previous page on how to use it. Once you install Globus Connect Personal on your personal device you can easily transfer files between your local machine and Torch.
To ease data transfer between clusters, Torch storage has been mounted on the Greene Data Transfer Nodes (e.g. dtn-1.hpc.nyu.edu and dtn-2.hpc.nyu.edu).
This allows users to access the data under /torch:
[netid@dtn-1]$ ls /torch
archive home scratch share
You can copy files using the rsync command from the Greene DTN:
[netid@dtn-1]$ rsync /scratch/netid/my_file.txt /torch/scratch/netid/my_file.txt
With upgrades to the NFS storage system, setfacl and getfacl no longer works. Please switch to using the nfs4_setfacl and nfs4_setfacl tools instead.
nfs4_setfacl – This is the main command that you will use. This is used to add, remove, or modify the ACL of a file. There are 4 options of real interest, though there are others (see the nfs4_setfacl(2) manual page, or run the command with -H to see all available options).
-a – This option tells nfs4_setfacl to add the specified Access Control Entry (ACE - defined below). Basically, this adds a new rule.
-x – This option causes nfs4_setfacl to remove the specified control. Note that this needs to match the rule exactly. Usually, to remove a control, it is easier to invoke nfs4_setfacl with the -e switch, or to use nfs4_getfacl, then copy/paste the line you'd like to remove.
-e – This switch, instead of directly modifying the ACL, puts you into a file editor with the ACL, so that you can add/remove/modify all the entries at once. Note that it puts you into whichever editor is specified in your EDITOR environment variable (run echo $EDITOR to see what yours is set to), or vi if none is specified.
–test – This switch tells nfs4_setfacl to not actually modify the ACL, but print out what it would be once it applied the operation you specified.
nfs4_getfacl – This command is very simple: it prints out the ACL of the file or directory you give it. Note that it can only take one file/directory at a time. See the nfs4_getfacl(1) manual page for more info.
You can move files from torch to Greene archive. While on Torch, you can run something like this:
scp -rp /path/on/torch dtn-1:/archive/<netID>/target-folder
Storage quotas have been applied on Torch:
/home: 50 GB and 30K inodes (same as on Greene)
/scratch: 5 TB and 5M inodes
/archive: 2 TB and 20K inodes
We are currently developing a myquota command for Torch to help users check their usage, which will be available soon.
Hardware Information is available on our Torch Announcement Page.
All HPC sponsors (PIs) must register HPC projects at https://projects.hpc.nyu.edu/ in order to submit slurm jobs on Torch.
Details on how to set up and configure an HPC Project are outlined on the new documentation website.
Please see here for details on how to use the new HPC Projects with Slurm Jobs on the Torch cluster.
On Torch, there are 232 H200 GPUs, contributed primarily by stakeholders from Courant, Tandon, CDS, and individual PIs, with 24 GPUs available for public use. In addition, the cluster includes 272 L40S GPUs, most of which are available for public use.
Currently, H200 GPUs are oversubscribed, while L40S GPUs remain largely idle. If your workloads can run on L40S GPUs, please be flexible in using either GPU type by adding the following Slurm request lines:
#SBATCH --gres=gpu:1
#SBATCH --constraint="l40s|h200"
GPU Job Limits
As with Greene, each user is limited by the following job constraints:
24 GPUs in total for wall time less than 48 hours
4 GPUs in total for jobs with wall time more than 48 hours.
We’ve updated the Slurm configuration to enable stakeholder partitions. The h200 and l40s partitions are preemptible, meaning jobs running there may be canceled to make room for stakeholder jobs.
Please do not specify partitions directly—just request the compute resources you need.
If you can use either L40S or H200 GPUs, request them with:
--gres=gpu:1 --constraint="l40s|h200"
The new Slurm configuration introduces pre-emption partitions, allowing:
Non-stakeholders to temporarily use stakeholder resources
One stakeholder group to use another’s resources
Stakeholder users will continue to have normal access to their own resources. When non-stakeholders (or other stakeholders) use these resources, their jobs may be preempted—that is, cancelled to free up GPUs or CPUs once stakeholders submit jobs that require them.
Jobs are eligible for pre-emption only after running for one hour. They will not be cancelled within the first hour.
By default, jobs are not submitted to pre-emption partitions. To enable pre-emption and automatic requeueing, add:
#SBATCH --comment="preemption=yes;requeue=true"
This setting allows jobs to run in both normal and pre-emption partitions. Jobs in stakeholder partitions will not be cancelled, while those in pre-emption partitions may be.
To use only pre-emption partitions, specify:
#SBATCH --comment="preemption=yes;preemption_partitions_only=yes;requeue=true"
Pre-emption partitions are also available for CPU-only jobs. More details will be covered in the upcoming Torch Test User HPC Tutorial on October 27.
sbatch --gres=gpu:1 --constraint=h200 --nodes=1 --tasks-per-node=1 \
--cpus-per-task=8 --mem=20GB --time=04:00:00 \
--wrap "hostname && sleep infinity"
sbatch --gres=gpu:1 --constraint=l40s --nodes=1 --tasks-per-node=1 \
--cpus-per-task=8 --mem=20GB --time=04:00:00 \
--wrap "hostname && sleep infinity"
sbatch --gres=gpu:1 --constraint="h200|l40s" --nodes=1 --tasks-per-node=1 \
--cpus-per-task=8 --mem=20GB --time=04:00:00 \
--wrap "hostname && sleep infinity"
Warning: srun interactive jobs are not supported for GPU jobs at this time.
Warning: Slurm email notification is currently unavailable.
Job Requing: Enable --requeue and checkpointing — jobs may be preempted.
Singularity can be located on torch at this directory location:
/share/apps/apptainer/bin/singularity
export APPTAINER_BINDPATH=/scratch,/state/partition1,/mnt,/share/apps
Use read-only mode for production jobs.
Writable overlay files (e.g., for Conda) are not reliable.
Images and wrapper scripts are available in:
/share/apps/images
If you already have container-based Conda environments on Greene, they can be copied to Torch with minor edits.
We strongly recommend container-based setups (Singularity/Apptainer) on Torch.
Warning: Avoid installing packages directly on the host system.
The OS and system libraries will be updated (e.g., to RHEL 10), which will likely break host-installed software.
Containers are more robust and portable.
Torch will implement preemptible partitions for GPU jobs. This means the following:
Stakeholder jobs may preempt public jobs when needed.
Use --requeue and implement checkpointing to resume jobs safely.
Please report any issues or questions via email at hpc@nyu.edu