Faster PyTorch

Tune num_workers

PyTorch DataLoader supports asynchronous data loading / augmentation in separate worker subprocesses.

The default setting for DataLoader is num_workers=0, which means that the data loading is synchronous and done in the main process, as a result a GPU has to wait on a CPU for the data to be available.

Setting num_workers to a number greater than zero enables asynchronous data loading and overlap between GPU and CPU computation, usually it significantly accelerates the training by reducing the bottleneck on CPU. Optimal number of num_workers needs to be tuned and depends on: type of data augmentation, number of CPU cores per GPU, location of the data (network share vs local storage) and other factors.

DataLoader accepts pin_memory argument, which defaults to False. In almost all settings it's better to set pin_memory=True, this will instruct DataLoader to use pinned memory and enables faster (and asynchronous) memcopy from host to the device.

More here

Better data loading

nvidia DALI library is designed to speed up loading. Try it here:

https://github.com/NVIDIA/DALI

Enable cuDNN auto-tuner

For convolutional networks, enable cuDNN autotuner by setting:

torch.backends.cudnn.benchmark = True

before launching the training loop.

CuDNN supports many algorithms to compute convolution. Autotuner runs a short benchmark and selects the kernel with the best performance on a given hardware for a given input size.

Currently torch.backends.cudnn.benchmark = True affects only convolutions.

Enable automatic mixed precision

AMP (Automatic Mixed Precision) is a tool to enable Tensor-Core-accelerated training and inference.

Native PyTorch AMP is available starting from PyTorch 1.6: documentation, examples

To use TensorCores, the sizes of tensors should be multiples of 8. This include batch size, number of inputs and number of outputs for both convolutional and fully connected layers.

Page updated

Google Sites

Report abuse