Intro:
https://drive.google.com/open?id=0B5mjl2eagJoWYjVZZzdneTRKeVE
1. Caffe with GPUs
https://sites.google.com/a/ku.th/gpu/caffe
Caffe with Spark
https://github.com/yahoo/CaffeOnSpark
2. Torch with GPUs
https://sites.google.com/a/ku.th/gpu/torch
3. Theano with GPUs
https://sites.google.com/a/ku.th/gpu/theano
4. TensorFlow with GPUs
Pip Installation
https://www.tensorflow.org/versions/r0.10/get_started/os_setup.html#pip-installation
https://sites.google.com/a/ku.th/gpu/tensorflow
Distributed tensorflow
https://sites.google.com/a/ku.th/gpu/tensorflow/distributed-tensorflow
https://sites.google.com/a/ku.th/gpu/tensorflow/distributed-tensorflow2
Tensorflow with MPI (built source with mpi option)
https://arxiv.org/abs/1603.02339
Compared to NCCL
https://www.tensorflow.org/api_docs/python/tf/contrib/nccl
http://docs.nvidia.com/deeplearning/sdk/nccl-install-guide/index.html
https://devblogs.nvidia.com/parallelforall/fast-multi-gpu-collectives-nccl/
https://images.nvidia.com/events/sc15/pdfs/NCCL-Woolley.pdf
Baidu allreduce
https://www.tensorflow.org/api_docs/python/tf/contrib/nccl
5. Nervana-- 16bit fixed point multiplication
https://sites.google.com/a/ku.th/parallel-computing/gpus/nervana
https://github.com/NervanaSystems/nervanagpu
https://github.com/NervanaSystems/neon.git
pip install nervananeon
. .venv/bin/activate
python examples/mnist_mlp.py -b gpu
And install graph
https://ngraph.nervanasys.com/docs/latest/walk_throughs.html
Try 32 bit GEM MMoperation with numpy int8, uint8, float16, float32
compared with Nervana gpu int8, uint8, fp16 bit, and 32 bit
Here is full example of doing a basic GEMM operation using 16-bit float:
(You need to change the library call due to new version using ngraph instead. )
import numpy as np import pycuda.autoinit from nervanagpu import NervanaGPU # initialize factory class ng = NervanaGPU(stochastic_round=False) m, n, k = 10, 20, 10 dtype = np.float16 # create matrices on host cpuA = np.random.randn(k,m) cpuB = np.random.randn(k,n) # transfer to device devA = ng.array(cpuA, dtype=dtype) devB = ng.array(cpuB, dtype=dtype) devC = ng.empty((m,n), dtype=dtype) # do GEMM operation ng.dot(devA.T, devB, devC, relu=False) # get from device cpuC = devC.get()
6. NVIDIA Docker & Docker
https://sites.google.com/a/ku.th/gpu/nvidia
7. DIGITS
https://sites.google.com/a/ku.th/gpu/digits
My slide
https://drive.google.com/open?id=1Sm6xL0ZfPu2D0q8UsmvMYY49FFP-6Nsy