Torch

Torch7 (http://www.torch.ch/) provides a Matlab-like environment for state-of-the-art machine learning algorithms. It is easy to use and provides a very efficient implementation.

Important Notes

(Imp) There is a single module for torch for both regular and GPU jobs in SLURM Cluster (hpctest). If you want to run gpu job, request the GPU nodes using gpu queue "-p gpu -C gpuk40".
If the torch module that you are looking for is not available in the installed version, you can install it in your home directory by referring to HPC Software Installation Guide.
For the torch version > 2016, please use gpu partition (-p gpu -C gpuk40)

Installed Versions

All the available versions of Torch for use can be viewed by issuing the following command. This applies for other applications as well. Also check the dependency modules - intel/17 and openmpi/2.0.1, to be loaded before loading torch.

module spider torch

output:

---------------------------------------------------------------------------------------------------------------------

Torch: Torch/7

---------------------------------------------------------------------------------------------------------------------

Description:

orch is a scientific computing framework with wide support for machine learning algorithms. It is easy to use

and efficient, thanks to an easy and fast scripting language, LuaJIT, and an underlying C/CUDA implementation.

You will need to load all module(s) on any one of the lines below before the "Torch/7" module is available to load.

intel/17 openmpi/2.0.1

Running Torch in HPC

Interactive Job

Request a compute node:

srun --pty /bin/bash

Load the module to setup the environment:

module load Torch

Viewing available Torch Modules

List all the installed Modules:

luarocks list

output:

cunn

scm-1 (installed) -

sundown

scm-1 (installed) - /home/sxg125/Software/torch/install/lib/luarocks/rocks

sys

1.1-0 (installed) - /home/sxg125/Software/torch/install/lib/luarocks/rocks

threads

scm-1 (installed) - /home/sxg125/Software/torch/install/lib/luarocks/rocks

torch

scm-1 (installed) - /home/sxg125/Software/torch/install/lib/luarocks/rocks

trepl

scm-1 (installed) - /home/sxg125/Software/torch/install/lib/luarocks/rocks

xlua

1.0-0 (installed) - /home/sxg125/Software/torch/install/lib/luarocks/rocks

....

If you can't find the modules, you can install it using "HPC Software Installation Guide".

Interactive Session:

Load torch module

module load Torch

Run

JIT: ON CMOV SSE2 SSE3 SSE4.1 fold cse dce fwd dse narrow loop abc sink fuse

> require 'torch'

> = torch.Tensor (5):zero()

[torch.DoubleTensor of dimension 5]

th> X = torch.rand(10, 10)

[0.0002s]

th> torch.inverse(X)

0.2060 -0.7466 -1.1770 -0.0381 -0.1921 2.6913 0.2922 1.1897 -2.3227 0.4178

1.3563 0.1284 -1.0628 -0.8171 1.0028 1.2230 0.3770 -0.3027 -0.6269 -0.7469

To exit, type os.exit() at the prompt or ctrl+c twice.

Example: Save a tensor or a set of tensors to a .mat file [3]

It is based on MATIO [4], an open-source C library for reading and writing binary MATLAB MAT files.

Load torch module

module load Torch

Copy the lua script "mat.lua" from /usr/local/doc/TORCH

cp -r /usr/local/doc/TORCH/tutorial/mat.lua .

Run:

th mat.lua

You will see 4 .mat files (test1.mat ... test4.mat)

Batch Job

Copy the directory "tutorial" from /usr/local/doc/TORCH

cp -r /usr/local/doc/TORCH/tutorial .

Go to the tutorial directory and find the job.slurm file which uses the rand.lua script. The content of the script file is showed below:

#!/bin/bash

#SBATCH -o TorchJob.o%j

#SBATCH --time=1:00:00

#SBATCH -N 1 -n 1

cp rand.lua $PFSDIR

# cd to temporary direcotry

cd $PFSDIR

# Load the modules

module load intel/17 openmpi/2.0.1

module load Torch

# Run torch

th rand.lua

# Copy everything back to the working directory

cp -ru * $SLURM_SUBMIT_DIR

GPU Jobs

Interactive

Request a GPU node:

srun -p gpu -C gpuk40 -N 1 -n 1 --gres=gpu:1 --pty /bin/bash

Load the torch7-cuda module

module load Torch

module load cuda/8.0

Get the command prompt by typing "th"

In the "th>" prompt type the following:

th> require 'cutorch'

th> require 'cunn'

th> X = torch.rand(10,10)

th> Y = X:cuda()

[0.0002s]

th> Y

0.8546 0.5259 0.6145 0.5444 0.7422 0.8323 0.1137 0.0077 0.9182 0.7745

0.2601 0.4049 0.0529 0.5991 0.1574 0.1480 0.3396 0.7089 0.9551 0.6881

...

th> print( cutorch.getDeviceProperties(cutorch.getDevice()) )

{

pciDeviceID : 0

warpSize : 32

freeGlobalMem : 4232075008

minor : 3

major : 1

maxTexture1DLinear : 134217728

...

}

th> x = torch.rand(10,10)

[0.0002s]

th> y = torch.sigmoid(x)

[0.0001s]

th> y

0.6122 0.6565 0.6894 0.6841 0.6214 0.6074 0.7148 0.6519 0.6954 0.6864

0.5269 0.6540 0.5060 0.6017 0.7216 0.5654 0.5943 0.5064 0.5560 0.6969

...

Batch Job

Copy the directory "tutorial" from /usr/local/doc/TORCH

cp -r /usr/local/doc/TORCH/tutorial .

Go to the tutorial directory and find the cuda-job.slurm file which uses the simple.lua script. The content of the script file is showed below:

#!/bin/bash

#SBATCH -o GPUTorchJob.o%j

#SBATCH --time=1:00:00

#SBATCH -N 1 -n 1

#SBATCH -p gpu -p gpuk40 --gres=gpu:1

cp simple.lua $PFSDIR

# cd to temporary direcotry

cd $PFSDIR

# Load the Torch7 module

module load intel/17 openmpi/2.0.1

module load torch

module load cuda/8.0

# Run torch

th simple.lua

# Copy everything back to the working directory

cp -r * $SLURM_SUBMIT_DIR

Run the job:

sbatch cuda-job.slurm

Get the output as GPUTorchJob.o<JOBID>

Torch Implementation of LRCN

The LRCN (Long-term Recurrent Convolutional Networks) model proposed by Jeff Donahue et. al has been implemented as torch-lrcn [7] using Torch7 framework. The algorithm for sequential motion recognition consists convolution neural network (CNN) and long short-term memory (LSTM) network. We are trying to speed up the process of LRCN enabling gpu acceleration with CUDA using Kepler-40 available in CWRU HPC.

Contributed by: Haotian Jiang from EECS

Copy the job file "job.slurm" from /usr/local/doc/TORCH to your home directory

cp /usr/local/doc/TORCH/torch-lrcn-master.tar.gz .

Untar the file and change dire/ctory to "torch-Lrcn-master"

tar xzvf torch-lrcn-master.tar.gz

cd torch-lrcn-master

Copy the job file "job.slurm" from /usr/local/doc/TORCH to your home directory

cp /usr/local/doc/TORCH/job.slurm .

In the torch script "train.lua", find the line "cmd:option('-cuda', 0)". For GPU implementation replace 0 with 1.

Submit the job

sbatch job.slurm

Check the execution time in the log file "TorchJob.o<JobID>"

3:21:22 Epoch 6 validation loss: nan

13:21:23 Saved checkpoint model and opt at checkpoints/checkpoint_6.t7

....

500

13:31:46 Epoch 30 training loss: 1.609733

13:31:46 Starting loss testing on the val split

13:31:46 Epoch 30 validation loss: nan

13:31:47 Saved checkpoint model and opt at checkpoints/checkpoint_final.t7

13:31:47 Finished training

real 16m41.871s

user 10m49.276s

sys 2m12.542s

Execution time without GPU:

real 131m9.290s

user 130m14.878s

sys 0m32.280s

Installing torch with Magma Support

Please follow the instruction at BitBucket contributed by Jing Chen from EECS Dept.

Refer to HPC Guide to Deep Learning & HPC Software Guide for more information.

References:

[1] Torch7: Home: https://github.com/torch/torch7/wiki/Cheatsheet#cuda

https://github.com/facebook/fbcunn/blob/master/INSTALL.md

[2] CUDA:Example: http://code.madbits.com/wiki/doku.php?id=tutorial_cuda

[3] MATIO Example- https://github.com/soumith/matio-ffi.torch

[4] MATIO Home: https://github.com/tbeu/matio