PyTorch DataLoader supports asynchronous data loading / augmentation in separate worker subprocesses.
The default setting for DataLoader is num_workers=0, which means that the data loading is synchronous and done in the main process, as a result a GPU has to wait on a CPU for the data to be available.
Setting num_workers to a number greater than zero enables asynchronous data loading and overlap between GPU and CPU computation, usually it significantly accelerates the training by reducing the bottleneck on CPU. Optimal number of num_workers needs to be tuned and depends on: type of data augmentation, number of CPU cores per GPU, location of the data (network share vs local storage) and other factors.
DataLoader accepts pin_memory argument, which defaults to False. In almost all settings it's better to set pin_memory=True, this will instruct DataLoader to use pinned memory and enables faster (and asynchronous) memcopy from host to the device.
More here
nvidia DALI library is designed to speed up loading. Try it here:
https://github.com/NVIDIA/DALI
For convolutional networks, enable cuDNN autotuner by setting:
torch.backends.cudnn.benchmark = True
before launching the training loop.
CuDNN supports many algorithms to compute convolution. Autotuner runs a short benchmark and selects the kernel with the best performance on a given hardware for a given input size.
Currently torch.backends.cudnn.benchmark = True affects only convolutions.
AMP (Automatic Mixed Precision) is a tool to enable Tensor-Core-accelerated training and inference.
Native PyTorch AMP is available starting from PyTorch 1.6: documentation, examples
To use TensorCores, the sizes of tensors should be multiples of 8. This include batch size, number of inputs and number of outputs for both convolutional and fully connected layers.