Trang chủ‎ > ‎IT‎ > ‎Data Science - Python‎ > ‎Tensorflow‎ > ‎

Using TensorFlow in Windows with a GPU

In case you missed it, TensorFlow is now available for Windows, as well as Mac and Linux. This was not always the case. For most of TensorFlow’s first year of existence, the only means of Windows support was virtualization, typically through Docker. Even without GPU support, this is great news for me. I teach a graduate course in deep learning and dealing with students who only run Windows was always difficult. Previously, I encouraged Windows students to either use Docker or the cloud. Now all will be able to run locally.

Using your GPU for deep learning is widely reported as highly effective. Clearly very high end GPU clusters can do some amazing things with deep learning. However, I was curious what deep learning could offer a high-end GPU that you might find on a laptop. Particularly, I was curious about my Windows Surface Book (GPU: GeForce GT 940) performance of using the GPU vs the CPU. Should I be using the GPU for my deep learning research? It turns out that I should be! For a simple example (see my class website), I got the following results:

CPU Version of TensorFlow: 1 hour, 54 minutes.
GPU Version of TensorFlow: 13 minutes.

The newer Surface Book’s have even more advanced GPU’s (GeForce GT 965). The TensorFlow playing field has really changed between Mac and Windows in the last year. When TensorFlow was first released (November 2015) there was no Windows version and I could get decent performance on my Mac Book Pro (GPU: NVidia 650M). Now, on the first day of 2017, the new Mac Book Pros are sporting a strange LCD touch bar (to replace function keys) and an AMD GPU. Both of which are useless to TensorFlow. At some point TensorFlow will probably add OpenCL support, and allow AMD chips to run. But, for now, NVidia CUDA is where most of the interesting developments are being made for deep learning.

I never thought I would say this a year ago, but the Microsoft Surface Book, is one of the best mainstream laptops for deep learning development. Of course, if you are willing to go outside the mainstream, there are more powerful options. Though, if you need extreme heavy lifting with GPU’s you should look to the cloud.

Installing

First, you should make sure you have the correct NVidia drivers installed:

Installing TensorFlow into Windows Python is a simple pip command. As of the writing of this post, TensorFlow requires Python 2.7, 3.4 or 3.5. In my case I used Anaconda Python 3.5. Read here to see what is currently supported The first thing that I did was create CPU and GPU environment for TensorFlow. This keeps them separate from other non-deep learning Python environments that I have. To create my CPU TensorFlow environment, I used:

conda create --name tensorflow python=3.5
activate tensorflow
conda install jupyter
conda install scipy
pip install tensorflow

To create my GPU TensorFlow environment, I used:

conda create --name tensorflow-gpu python=3.5
activate tensorflow-gpu
conda install jupyter
conda install scipy
pip install tensorflow-gpu

Your TensorFlow code will not change using a single GPU. You can simply run the same code by switching environments. TensorFlow will either use the GPU or not, depending on which environment you are in. You can switch between environments with:

activate tensorflow
activate tensorflow-gpu

Conclusions

If you are doing moderate deep learning networks and data sets on your local computer you should probably be using your GPU. Even if you are using a laptop. NVidia is the GPU of choice for scientific computing. While AMD might be fully capable, support for AMD is much more sparse.

Using GPUs

Supported devices

On a typical system, there are multiple computing devices. In TensorFlow, the supported device types are CPU and GPU. They are represented as strings. For example:

  • "/cpu:0": The CPU of your machine.
  • "/gpu:0": The GPU of your machine, if you have one.
  • "/gpu:1": The second GPU of your machine, etc.

If a TensorFlow operation has both CPU and GPU implementations, the GPU devices will be given priority when the operation is assigned to a device. For example, matmul has both CPU and GPU kernels. On a system with devices cpu:0 and gpu:0gpu:0 will be selected to run matmul.

Logging Device placement

To find out which devices your operations and tensors are assigned to, create the session with log_device_placement configuration option set to True.

# Creates a graph.
a
= tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
b
= tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
c
= tf.matmul(a, b)
# Creates a session with log_device_placement set to True.
sess
= tf.Session(config=tf.ConfigProto(log_device_placement=True))
# Runs the op.
print sess.run(c)

You should see the following output:

Device mapping:
/job:localhost/replica:0/task:0/gpu:0 -> device: 0, name: Tesla K40c, pci bus
id
: 0000:05:00.0
b
: /job:localhost/replica:0/task:0/gpu:0
a
: /job:localhost/replica:0/task:0/gpu:0
MatMul: /job:localhost/replica:0/task:0/gpu:0
[[ 22.  28.]
 
[ 49.  64.]]

Manual device placement

If you would like a particular operation to run on a device of your choice instead of what's automatically selected for you, you can use with tf.device to create a device context such that all the operations within that context will have the same device assignment.

# Creates a graph.
with tf.device('/cpu:0'):
  a
= tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
  b
= tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
  c
= tf.matmul(a, b)
# Creates a session with log_device_placement set to True.
sess
= tf.Session(config=tf.ConfigProto(log_device_placement=True))
# Runs the op.
print sess.run(c)

You will see that now a and b are assigned to cpu:0.

Device mapping:
/job:localhost/replica:0/task:0/gpu:0 -> device: 0, name: Tesla K40c, pci bus
id
: 0000:05:00.0
b
: /job:localhost/replica:0/task:0/cpu:0
a
: /job:localhost/replica:0/task:0/cpu:0
MatMul: /job:localhost/replica:0/task:0/gpu:0
[[ 22.  28.]
 
[ 49.  64.]]

Allowing GPU memory growth

By default, TensorFlow maps nearly all of the GPU memory of all GPUs (subject toCUDA_VISIBLE_DEVICES) visible to the process. This is done to more efficiently use the relatively precious GPU memory resources on the devices by reducing memory fragmentation.

In some cases it is desirable for the process to only allocate a subset of the available memory, or to only grow the memory usage as is needed by the process. TensorFlow provides two Config options on the Session to control this.

The first is the allow_growth option, which attempts to allocate only as much GPU memory based on runtime allocations: it starts out allocating very little memory, and as Sessions get run and more GPU memory is needed, we extend the GPU memory region needed by the TensorFlow process. Note that we do not release memory, since that can lead to even worse memory fragmentation. To turn this option on, set the option in the ConfigProto by:

config = tf.ConfigProto()
config
.gpu_options.allow_growth = True
session
= tf.Session(config=config, ...)

The second method is the per_process_gpu_memory_fraction option, which determines the fraction of the overall amount of memory that each visible GPU should be allocated. For example, you can tell TensorFlow to only allocate 40% of the total memory of each GPU by:

config = tf.ConfigProto()
config
.gpu_options.per_process_gpu_memory_fraction = 0.4
session
= tf.Session(config=config, ...)

This is useful if you want to truly bound the amount of GPU memory available to the TensorFlow process.

Using a single GPU on a multi-GPU system

If you have more than one GPU in your system, the GPU with the lowest ID will be selected by default. If you would like to run on a different GPU, you will need to specify the preference explicitly:

# Creates a graph.
with tf.device('/gpu:2'):
  a
= tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
  b
= tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
  c
= tf.matmul(a, b)
# Creates a session with log_device_placement set to True.
sess
= tf.Session(config=tf.ConfigProto(log_device_placement=True))
# Runs the op.
print sess.run(c)

If the device you have specified does not exist, you will get InvalidArgumentError:

InvalidArgumentError: Invalid argument: Cannot assign a device to node 'b':
Could not satisfy explicit device specification '/gpu:2'
   
[[Node: b = Const[dtype=DT_FLOAT, value=Tensor<type: float shape: [3,2]
   values
: 1 2 3...>, _device="/gpu:2"]()]]

If you would like TensorFlow to automatically choose an existing and supported device to run the operations in case the specified one doesn't exist, you can set allow_soft_placement to True in the configuration option when creating the session.

# Creates a graph.
with tf.device('/gpu:2'):
  a
= tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
  b
= tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
  c
= tf.matmul(a, b)
# Creates a session with allow_soft_placement and log_device_placement set
# to True.
sess
= tf.Session(config=tf.ConfigProto(
      allow_soft_placement
=True, log_device_placement=True))
# Runs the op.
print sess.run(c)

Using multiple GPUs

If you would like to run TensorFlow on multiple GPUs, you can construct your model in a multi-tower fashion where each tower is assigned to a different GPU. For example:

# Creates a graph.
c
= []
for d in ['/gpu:2', '/gpu:3']:
 
with tf.device(d):
    a
= tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3])
    b
= tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2])
    c
.append(tf.matmul(a, b))
with tf.device('/cpu:0'):
  sum
= tf.add_n(c)
# Creates a session with log_device_placement set to True.
sess
= tf.Session(config=tf.ConfigProto(log_device_placement=True))
# Runs the op.
print sess.run(sum)

You will see the following output.

Device mapping:
/job:localhost/replica:0/task:0/gpu:0 -> device: 0, name: Tesla K20m, pci bus
id
: 0000:02:00.0
/job:localhost/replica:0/task:0/gpu:1 -> device: 1, name: Tesla K20m, pci bus
id
: 0000:03:00.0
/job:localhost/replica:0/task:0/gpu:2 -> device: 2, name: Tesla K20m, pci bus
id
: 0000:83:00.0
/job:localhost/replica:0/task:0/gpu:3 -> device: 3, name: Tesla K20m, pci bus
id
: 0000:84:00.0
Const_3: /job:localhost/replica:0/task:0/gpu:3
Const_2: /job:localhost/replica:0/task:0/gpu:3
MatMul_1: /job:localhost/replica:0/task:0/gpu:3
Const_1: /job:localhost/replica:0/task:0/gpu:2
Const: /job:localhost/replica:0/task:0/gpu:2
MatMul: /job:localhost/replica:0/task:0/gpu:2
AddN: /job:localhost/replica:0/task:0/cpu:0
[[  44.   56.]
 
[  98.  128.]]

The cifar10 tutorial is a good example demonstrating how to do training with multiple GPUs.

--------------------------------------------------------------------------------------------------------------------------

http://learningtensorflow.com/lesson10/

Installing GPU-enabled TensorFlow

If you didn’t install the GPU-enabled TensorFlow earlier then we need to do that first. Our instructions in Lesson 1 don’t say to, so if you didn’t go out of your way to enable GPU support than you didn’t.

I recommend that you create a new Anaconda environment for this, rather than try to update your previous one.

Before you start

Head to the official TensorFlow installation instructions, and follow the Anaconda Installation instructions. The main difference between this, and what we did in Lesson 1, is that you need the GPU enabled version of TensorFlow for your system. However, before you install TensorFlow into this environment, you need to setup your computer to be GPU enabled with CUDA and CuDNN. The official TensorFlow documentation outline this step by step, but I recommended this tutorial if you are trying to setup a recent Ubuntu install. The main reason is that, at the time of writing (July 2016), CUDA has not yet been built for the most recent Ubuntu version, which means the process is a lot more manual.

Using your GPU

It’s quite simple really. At least, syntactically. Just change this:

# Setup operations

with tf.Session() as sess:
    # Run your code

To this:

with tf.device("/gpu:0"):
    # Setup operations

with tf.Session() as sess:
    # Run your code

This new line will create a new context manager, telling TensorFlow to perform those actions on the GPU.

Let’s have a look at a concrete example. The below code creates a random matrix with a size given at the command line. We can either run the code on a CPU or GPU using command line options:

import sys
import numpy as np
import tensorflow as tf
from datetime import datetime

device_name = sys.argv[1]  # Choose device from cmd line. Options: gpu or cpu
shape = (int(sys.argv[2]), int(sys.argv[2]))
if device_name == "gpu":
    device_name = "/gpu:0"
else:
    device_name = "/cpu:0"

with tf.device(device_name):
    random_matrix = tf.random_uniform(shape=shape, minval=0, maxval=1)
    dot_operation = tf.matmul(random_matrix, tf.transpose(random_matrix))
    sum_operation = tf.reduce_sum(dot_operation)


startTime = datetime.now()
with tf.Session(config=tf.ConfigProto(log_device_placement=True)) as session:
        result = session.run(sum_operation)
        print(result)

# It can be hard to see the results on the terminal with lots of output -- add some newlines to improve readability.
print("\n" * 5)
print("Shape:", shape, "Device:", device_name)
print("Time taken:", datetime.now() - startTime)

print("\n" * 5)

You can run this at the command line with:

python matmul.py gpu 1500

This will use the CPU with a matrix of size 1500 squared. Use the following to do the same operation on the CPU:

python matmul.py cpu 1500

The first thing you’ll notice when running GPU-enabled code is a large increase in output, compared to a normal TensorFlow script. Here is what my computer prints out, before it prints out any result from the operations.

```python

I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so.5 locally I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so.1 locally I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:925] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 0 with properties: name: GeForce GTX 950M major: 5 minor: 0 memoryClockRate (GHz) 1.124 pciBusID 0000:01:00.0 Total memory: 3.95GiB Free memory: 3.50GiB I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0 I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0: Y I tensorflow/core/common_runtime/gpu/gpu_device.cc:838] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 950M, pci bus id: 0000:01:00.0)

```

If your code doesn’t produce output similar in nature to this, you aren’t running the GPU enabled Tensorflow. Alternatively, if you get an error such as ImportError: libcudart.so.7.5: cannot open shared object file: No such file or directory, then you haven’t installed the CUDA library properly. In this case, you’ll need to go back to follow the instructions for installing CUDA on your system.

Try running the above code on both the CPU and GPU, increasing the number slowly. Start with 1500, then try 3000, then 4500, and so on. You’ll find that the CPU starts taking quite a long time, while the GPU is really, really fast at this operation!

If you have multiple GPUs, you can use either. GPUs are zero-indexed - the above code accesses the first GPU. Changing the device to gpu:1 uses the second GPU, and so on. You can also send part of your computation to one GPU, and part to another GPU. In addition, you can access the CPUs of your machine in a similar way – just use cpu:0 (or another number).

What types of operations should I send to the GPU?

In general, if the step of the process can be described such as “do this mathematical operation thousands of times”, then send it to the GPU. Examples include matrix multiplication and computing the inverse of a matrix. In fact, many basic matrix operations are prime candidates for GPUs. As an overly broad and simple rule, other operations should be performed on the CPU.

There is also a cost to changing devices and using GPUs. GPUs don’t have direct access to the rest of your computer (except, of course for the display). Due to this, if you are running a command on a GPU, you need to copy all of the data to the GPU first, then do the operation, then copy the result back to your computer’s main memory. TensorFlow handles this under the hood, so the code is simple, but the work still needs to be performed.

Not all operations can be done on GPUs. If you get the following error, you are trying to do an operation that can’t be done on a GPU:

Cannot assign a device to node 'PyFunc': Could not satisfy explicit device specification '/device:GPU:1' because no devices matching that specification are registered in this process;

If this is the case, you can either manually change the device to a CPU for this operation, or set TensorFlow to automatically change the device in this case. To do this, set allow_soft_placement tp True in the configuration, done as part of creating the session. The prototype looks like this:

with tf.Session(config=tf.ConfigProto(allow_soft_placement=True)):
    # Run your graph here
with tf.Session(config=tf.ConfigProto(allow_soft_placement=True, log_device_placement=True)):
    # Run your graph here

I also recommend logging device placement when using GPUs, at this lets you easily debug issues relating to different device usage. This prints the usage of devices to the log, allowing you to see when devices change and how that affects the graph.

--------------------------------------------------------------------------------------------------------------------------------

Previously, it was possible to run TensorFlow within a Windows environment by using a Docker container. There were many downsides to this method—the most significant of which was lack of GPU support. With GPUs often resulting in more than a 10x performance increase over CPUs, it's no wonder that people were interested in running TensorFlow natively with full GPU support. As of December 2016, this is now possible. And the best part is, it only takes about 5 minutes to setup:

Prerequisites:

GPU+ Machine

TensorFlow relies on a technology called CUDA which is developed by NVIDIA. The GPU+ machine includes a CUDA enabled GPU and is a great fit for TensorFlow and Machine Learning in general. It is possible to run TensorFlow without a GPU (using the CPU) but you'll see the performance benefit of using the GPU below.

CUDA

Download Link Recommended version: Cuda Toolkit 8.0

The installation will offer to install the NVIDIA driver. This is already installed so uncheck this box to skip this step. 
A restart is required to complete the installation.

cuDNN

Download Link Recommended version: cuDNN v5.1

On Windows, cuDNN is distributed as a zip archive. Extract it and add the Windows path. I'll extract it to C:\tools\cuda\bin and run:

set PATH=%PATH%;C:\tools\cuda\bin  

Python

Download Link

If you don't yet have Python installed, Python 3.5 from Anaconda is easy to setup. This is a pretty large installation so it will take a few minutes. TensorFlow currently requires Python 2.7, 3.4 or 3.5.

Installing TensorFlow

First, we'll create a virtual environment for our project:

conda create --name tensorflow-gpu python=3.5  

Then activate or switch into this virtual environment:

activate tensorflow-gpu  

And finally, install TensorFlow with GPU support:

pip install tensorflow-gpu  

Test the TensorFlow installation

python  
...
>>> import tensorflow as tf
>>> hello = tf.constant('Hello, TensorFlow!')
>>> sess = tf.Session()
>>> print(sess.run(hello))
Hello, TensorFlow!  
>>> a = tf.constant(10)
>>> b = tf.constant(32)
>>> print(sess.run(a + b))
42  
>>>

The installation is complete and we're ready to run our first model.


Let's run a model!

Run a TensorFlow demo model

Now for the fun part. TensorFlow ships with a few demo models. We'll navigate to the directory where they're located and run a simple model for classifying handwritten digits from the MNIST dataset:

cd C:\Users\Paperspace\Anaconda3\envs\tensorflow-gpu\Lib\site-packages\tensorflow\models\image\mnist  
python convolutional.py  

If everything is configured correctly, you should see something similar in your window: 
You can see that each line is taking roughly 11-12 ms to run. That's pretty impressive. To see what a huge difference the GPU makes, I'll deactivate it and run the same model.

conda create --name tensorflow python=3.5  
activate tensorflow  
pip install tensorflow  

As you can see, each line is taking roughly 190 ms. Leveraging the GPU results in a 17x performance increase!

It's worth mentioning that we're running this is on a powerful 8 core Intel Xeon processor—the GPU speedup will often exceed these results.


Wrapping up:

Monitoring GPU utilization

Finally, here are two ways I can monitor my GPU usage:

NVIDIA-SMI

NVIDIA-SMI is a tool built-into the NVIDIA driver that will expose the GPU usage directly in Command Prompt. Navigate to its location and run it.

cd C:\Program Files\NVIDIA Corporation\NVSMI  
nvidia-smi.exe  

GPU-Z

TechPowerUp makes a pretty popular GPU monitoring tool called GPU-Z which is a bit more friendly to use. Download it here.

NVIDIA-SMI and GPU-Z running side-by-side

That's it. Let us know what you think!

Comments