TensorFlow is an open-source software library in python for machine learning projects. It was originally developed by Google as an in-house replacement for DistBelief. Tensorflow 1.5.0 is currently available on the Coeus High Performance Cluster as a loadable module. Coeus also has a Phi version of Tensorflow that takes advantage of CPU features on the phi partitions. This documentation will go over how to use Tensorflow on the HPC, as well as how to run your Tensorflow job on the HPC.
To load the module, enter the following command
> module load MachineLearning/tensorflow/1.5.0/tensorflow-standard-mpi
To load the module on the phi servers, enter the following command
> module load MachineLearning/tensorflow/1.5.0/tensorflow-phi-mpi
Then you can import the library in python
> python3
Python 3.6.4 (default, Jan 8 2018, 00:23:22)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-16)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow
>>>
In this section, we will go over the proper way of running your Tensorflow project on the Coeus cluster. This will consist of two scripts. The first script will be the Tensorflow script that you will be running (tensorflowHelloWorld.py). The second script is the sbatch script that will run your python script on the cluster (submit.sh).
File: tensorflowHelloWorld.py - Your Tensorflow project
from __future__ import print_function
import tensorflow as tf
hello = tf.constant('Hello, TensorFlow!')
sess = tf.Session()
print(sess.run(hello))
File: submit.sh - Your sbatch script to run your Tensorflow project
#!/bin/bash
#SBATCH --job-name my_job
#SBATCH --ntasks 10
#SBATCH --nodes 10
#SBATCH --partition allcpu
#SBATCH --output tf_test%j.txt
#SBATCH --error tf_test%j.err
module load MachineLearning/tensorflow/1.5.0/tensorflow-standard-mpi
mpirun -np 10 python3 tensorflowHelloWorld.py
Optional File: submitphi.sh - The same sbatch script above with phi support
#!/bin/bash
#SBATCH --job-name my_phi_job
#SBATCH --ntasks 10
#SBATCH --nodes 10
#SBATCH --partition phi
#SBATCH --output tf_test%j.txt
#SBATCH --error tf_test%j.err
module load MachineLearning/tensorflow/1.5.0/tensorflow-phi-mpi
mpirun -np 10 python3 tensorflowHelloWorld.py
Tensorflow Beginners Guide: https://www.tensorflow.org/get_started/get_started_for_beginners