Torch
Torch
Torch7 (http://www.torch.ch/) provides a Matlab-like environment for state-of-the-art machine learning algorithms. It is easy to use and provides a very efficient implementation.
Important Notes
(Imp) There is a single module for torch for both regular and GPU jobs in SLURM Cluster (hpctest). If you want to run gpu job, request the GPU nodes using gpu queue "-p gpu -C gpuk40".
If the torch module that you are looking for is not available in the installed version, you can install it in your home directory by referring to HPC Software Installation Guide.
For the torch version > 2016, please use gpu partition (-p gpu -C gpuk40)
Installed Versions
All the available versions of Torch for use can be viewed by issuing the following command. This applies for other applications as well. Also check the dependency modules - intel/17 and openmpi/2.0.1, to be loaded before loading torch.
module spider torch
output:
---------------------------------------------------------------------------------------------------------------------
Torch: Torch/7
---------------------------------------------------------------------------------------------------------------------
Description:
orch is a scientific computing framework with wide support for machine learning algorithms. It is easy to use
and efficient, thanks to an easy and fast scripting language, LuaJIT, and an underlying C/CUDA implementation.
You will need to load all module(s) on any one of the lines below before the "Torch/7" module is available to load.
intel/17 openmpi/2.0.1
Running Torch in HPC
Interactive Job
Request a compute node:
srun --pty /bin/bash
Load the module to setup the environment:
module load Torch
Viewing available Torch Modules
List all the installed Modules:
luarocks list
output:
cunn
scm-1 (installed) -
sundown
scm-1 (installed) - /home/sxg125/Software/torch/install/lib/luarocks/rocks
sys
1.1-0 (installed) - /home/sxg125/Software/torch/install/lib/luarocks/rocks
threads
scm-1 (installed) - /home/sxg125/Software/torch/install/lib/luarocks/rocks
torch
scm-1 (installed) - /home/sxg125/Software/torch/install/lib/luarocks/rocks
trepl
scm-1 (installed) - /home/sxg125/Software/torch/install/lib/luarocks/rocks
xlua
1.0-0 (installed) - /home/sxg125/Software/torch/install/lib/luarocks/rocks
....
If you can't find the modules, you can install it using "HPC Software Installation Guide".
Interactive Session:
Load torch module
module load Torch
Run
th
LuaJIT 2.0.2 -- Copyright (C) 2005-2013 Mike Pall. http://luajit.org/
JIT: ON CMOV SSE2 SSE3 SSE4.1 fold cse dce fwd dse narrow loop abc sink fuse
> require 'torch'
> = torch.Tensor (5):zero()
0
0
0
0
0
[torch.DoubleTensor of dimension 5]
th> X = torch.rand(10, 10)
[0.0002s]
th> torch.inverse(X)
0.2060 -0.7466 -1.1770 -0.0381 -0.1921 2.6913 0.2922 1.1897 -2.3227 0.4178
1.3563 0.1284 -1.0628 -0.8171 1.0028 1.2230 0.3770 -0.3027 -0.6269 -0.7469
To exit, type os.exit() at the prompt or ctrl+c twice.
Example: Save a tensor or a set of tensors to a .mat file [3]
It is based on MATIO [4], an open-source C library for reading and writing binary MATLAB MAT files.
Load torch module
module load Torch
Copy the lua script "mat.lua" from /usr/local/doc/TORCH
cp -r /usr/local/doc/TORCH/tutorial/mat.lua .
Run:
th mat.lua
You will see 4 .mat files (test1.mat ... test4.mat)
Batch Job
Copy the directory "tutorial" from /usr/local/doc/TORCH
cp -r /usr/local/doc/TORCH/tutorial .
Go to the tutorial directory and find the job.slurm file which uses the rand.lua script. The content of the script file is showed below:
#!/bin/bash
#SBATCH -o TorchJob.o%j
#SBATCH --time=1:00:00
#SBATCH -N 1 -n 1
cp rand.lua $PFSDIR
# cd to temporary direcotry
cd $PFSDIR
# Load the modules
module load intel/17 openmpi/2.0.1
module load Torch
# Run torch
th rand.lua
# Copy everything back to the working directory
cp -ru * $SLURM_SUBMIT_DIR
GPU Jobs
Interactive
Request a GPU node:
srun -p gpu -C gpuk40 -N 1 -n 1 --gres=gpu:1 --pty /bin/bash
Load the torch7-cuda module
module load Torch
module load cuda/8.0
Get the command prompt by typing "th"
In the "th>" prompt type the following:
th> require 'cutorch'
th> require 'cunn'
th> X = torch.rand(10,10)
th> Y = X:cuda()
[0.0002s]
th> Y
0.8546 0.5259 0.6145 0.5444 0.7422 0.8323 0.1137 0.0077 0.9182 0.7745
0.2601 0.4049 0.0529 0.5991 0.1574 0.1480 0.3396 0.7089 0.9551 0.6881
...
th> print( cutorch.getDeviceProperties(cutorch.getDevice()) )
{
pciDeviceID : 0
warpSize : 32
freeGlobalMem : 4232075008
minor : 3
major : 1
maxTexture1DLinear : 134217728
...
}
th> x = torch.rand(10,10)
[0.0002s]
th> y = torch.sigmoid(x)
[0.0001s]
th> y
0.6122 0.6565 0.6894 0.6841 0.6214 0.6074 0.7148 0.6519 0.6954 0.6864
0.5269 0.6540 0.5060 0.6017 0.7216 0.5654 0.5943 0.5064 0.5560 0.6969
...
Batch Job
Copy the directory "tutorial" from /usr/local/doc/TORCH
cp -r /usr/local/doc/TORCH/tutorial .
Go to the tutorial directory and find the cuda-job.slurm file which uses the simple.lua script. The content of the script file is showed below:
#!/bin/bash
#SBATCH -o GPUTorchJob.o%j
#SBATCH --time=1:00:00
#SBATCH -N 1 -n 1
#SBATCH -p gpu -p gpuk40 --gres=gpu:1
cp simple.lua $PFSDIR
# cd to temporary direcotry
cd $PFSDIR
# Load the Torch7 module
module load intel/17 openmpi/2.0.1
module load torch
module load cuda/8.0
# Run torch
th simple.lua
# Copy everything back to the working directory
cp -r * $SLURM_SUBMIT_DIR
Run the job:
sbatch cuda-job.slurm
Get the output as GPUTorchJob.o<JOBID>
Torch Implementation of LRCN
The LRCN (Long-term Recurrent Convolutional Networks) model proposed by Jeff Donahue et. al has been implemented as torch-lrcn [7] using Torch7 framework. The algorithm for sequential motion recognition consists convolution neural network (CNN) and long short-term memory (LSTM) network. We are trying to speed up the process of LRCN enabling gpu acceleration with CUDA using Kepler-40 available in CWRU HPC.
Contributed by: Haotian Jiang from EECS
Copy the job file "job.slurm" from /usr/local/doc/TORCH to your home directory
cp /usr/local/doc/TORCH/torch-lrcn-master.tar.gz .
Untar the file and change dire/ctory to "torch-Lrcn-master"
tar xzvf torch-lrcn-master.tar.gz
cd torch-lrcn-master
Copy the job file "job.slurm" from /usr/local/doc/TORCH to your home directory
cp /usr/local/doc/TORCH/job.slurm .
In the torch script "train.lua", find the line "cmd:option('-cuda', 0)". For GPU implementation replace 0 with 1.
Submit the job
sbatch job.slurm
Check the execution time in the log file "TorchJob.o<JobID>"
3:21:22 Epoch 6 validation loss: nan
13:21:23 Saved checkpoint model and opt at checkpoints/checkpoint_6.t7
4
8
12
16
20
24
28
32
36
40
44
....
500
13:31:46 Epoch 30 training loss: 1.609733
13:31:46 Starting loss testing on the val split
13:31:46 Epoch 30 validation loss: nan
13:31:47 Saved checkpoint model and opt at checkpoints/checkpoint_final.t7
13:31:47 Finished training
real 16m41.871s
user 10m49.276s
sys 2m12.542s
Execution time without GPU:
real 131m9.290s
user 130m14.878s
sys 0m32.280s
Installing torch with Magma Support
Please follow the instruction at BitBucket contributed by Jing Chen from EECS Dept.
Refer to HPC Guide to Deep Learning & HPC Software Guide for more information.
References:
[1] Torch7: Home: https://github.com/torch/torch7/wiki/Cheatsheet#cuda
https://github.com/facebook/fbcunn/blob/master/INSTALL.md
[2] CUDA:Example: http://code.madbits.com/wiki/doku.php?id=tutorial_cuda
[3] MATIO Example- https://github.com/soumith/matio-ffi.torch
[4] MATIO Home: https://github.com/tbeu/matio