[TBD]Role of Parallelism in data science

Introduction

Laymen explanation

Parallel processing is not magic, and cannot "just be used" in every situation; there are both practical and theoretical algorithmic design issues that must be considered when even thinking about incorporating parallel processing into a project.

However, the trouble associated with parallelism may very well be worth it in a given situation due solely to the potential of dramatic time-savings related to algorithm execution.

Technical explanation

Over the past decade, neural networks have achieved state-of-the-art results in a wide variety of prediction tasks, including image classification, machine translation, and speech recognition. These successes have been driven, at least in part, by hardware and software improvements that have significantly accelerated neural network training. Faster training has directly resulted in dramatic improvements to model quality, both by allowing more training data to be processed and by allowing researchers to try new ideas and configurations more rapidly

GPGPU paradigm for parallel computing

CUDA is technically a heterogeneous computing environment, meaning that it facilitates coordinated computing on both CPUs and GPUs. The CUDA parallel programming framework from NVIDIA is a particular implementation of the GPGPU paradigm

Advantage of GPGPU

CUDA uses SIMD(Single instruction multiple data) architecture and so, same instruction can be executed on multiple data in parallel.

For example, Remember that cost/error function in backpropagation uses summation operation and this summation can be done in parallel.

Types of parallelism in context of data science

Model parallelism

Model parallelism is, when you split the model among GPUs and use the same data for each model; so each GPU works on a part of the model rather than a part of the data.

While model parallelism makes it possible to train neural networks that are larger than a single processor can support, it usually requires tailoring the model architecture to the available hardware.

Data parallelism

In this, data is split among GPUs and then after computation, it is combined together. In contrary to model parallelism, Data parallelism is model agnostic and applicable to any ML architecture

Benefit of data parallelism

Data parallelism improves model accuracy as well. Speed is commonly known benefit.

To understand it, look at this. The shortcoming of stochastic gradient descent used in neural networks is that the estimate of the gradients might not accurately represent the true gradients of using the full dataset. Therefore, it may take much longer to converge.

A natural way to have a more accurate estimate of the gradients is to use larger batch sizes or even use the full dataset. To allow this, the gradients of small batches were calculated on each GPU, the final estimate of the gradients is the weighted average of the gradients calculated from all the small batches.

Then How big batch size should be?

It depends on the model and dataset (Refer below picture)

In below left picture, A transformer neural network scales to much larger batch sizes than an LSTM neural network on the LM1B dataset. So, transformer model gets better benefit out of parallelism since it converges faster as batch size increases.

In below right picture, The Common Crawl dataset does not benefit from larger batch sizes than the LM1B dataset, even though it is 1,000 times the size.

Synchronous versus asynchronous data parallelism

Find real example

Parallelism with multi core

Parallelism with network computers

Cloud Tensor Processing Unit (TPU)

Cloud TPU is the custom-designed machine learning ASIC that powers Google products like Translate, Photos, Search, Assistant, and Gmail.

DPU

It is special kind of GPU designed for machine learning processing. Refer here for detail.

Libraries supporting parallelism

pytorch python library supports parallelism. Refer here for detail. Tensorflow also supports it. Same is the case with Numpy library.

Enabling parallel computing while ML coding

Role in machine learning

Speed up computing
Improve accuracy

Use in neural networks

When training neural networks, the primary ways to achieve this are model parallelism, which involves distributing the neural network across different processors, and data parallelism, which involves distributing training examples across different processors and computing updates to the neural network in parallel.

Reference

https://www.kdnuggets.com/2016/11/parallelism-machine-learning-gpu-cuda-threading.html

https://images.app.goo.gl/STw7X4iTUwwVsjnz7

https://timdettmers.com/2014/11/09/model-parallelism-deep-learning/

https://www.linkedin.com/pulse/scaling-deep-learning-highlights-from-startupml-george-williams/

https://leimao.github.io/blog/Data-Parallelism-vs-Model-Paralelism/

https://ai.googleblog.com/2019/03/measuring-limits-of-data-parallel.html

https://cloud.google.com/tpu/

https://images.app.goo.gl/QZvGbRyxBgA8gA4P7

https://images.app.goo.gl/YuiSBMRKFpTUU8aZ6

https://www.telesens.co/2017/12/25/understanding-data-parallelism-in-machine-learning/

https://realpython.com/numpy-tensorflow-performance/

https://towardsdatascience.com/reasons-to-choose-pytorch-for-deep-learning-c087e031eaca

https://www.quora.com/How-does-NumPy-make-parallel-computations-Where-can-I-get-the-algorithms-of-NumPy-functions-and-their-implementations

https://sites.google.com/site/jbsakabffoi12449ujkn/home/machine-intelligence/dpu-high-performance-processing-unit-for-machine-learning

https://images.app.goo.gl/ZMfMrE4hTPo9yqsb7

https://www.oreilly.com/content/distributed-tensorflow/