A sample of the framework for running distributed training can be found here.
An overview of the distributed training process is provided using the figure below.
Let's assume that we are running experiments with a SLURM job system, using two nodes and 3 GPUs per node. main.py is our job as illustrated in the figure below. After we execute our job using srun main.py (see the script here for details), main.py will be executed separately and independently in each of the two nodes. This means that if the variables are modified or created in main.py in one node, it will not affect the variables of main.py in the other nodes.
Then in each node, the multiprocessing package will be used to execute the code in main.py, let's call it main_worker(), in parallel at each of the three GPUs available in the node. So now actually main_worker() is executed by six GPUs in parallel, considering the two nodes that we are using.
In each main_worker(), we will create a dataset, learning model, and optimizer and train the learning model with multiple epochs. So we need something to provide a tunnel for communication between the six GPUs and synchronize the tasks running in the six GPUs. The distributed package in python (DDP) is the hero plays the role. The DDP will be initialized at the beginning of the code main_worker() via function init_process_group(...) which requires global GPU rank (see the figure below), an IP address and a free port (see the script here for details of how to get them).
Principles when using the DDP:
(1) if we set the total number of GPUs as 10 while only having six GPUs, init_process_group(...) will keep waiting until the job is killed automatically after a long waiting.
(2) Each GPU has a learning model created. The learning models of the six GPUs are identical. At each epoch, distributed_sampler in the DDP will assign each GPU a mini-batch of the same batch size, e.g. 128. So in total actually we are using a batch size of 128*6=786 for the training. Then each learning model processes its mini-batch and gets gradients. The DDP will average out all the gradients and send them to each GPU, then each learning model is updated using the same gradients.
(3) The training errors are different at different GPUs, because they are using different mini-batches.
(4) The testing errors are identical at different GPUs, so we only need to print out the best test error and save the corresponding checkpoint at one GPU, e.g., at the GPU which has a Global GPU rank of 0.
Each GPU is controlled by one process which is also called a CPU or a worker or a core. Usually, we request a bunch of workers when executing srun main.py, e.g., 42 works per GPU. So if we use 12 workers per GPU for the dataloader and one is used for controlling the GPU, then 29 workers will remain idle.
You can try out the DDP yourself with the framework available here.