https://pytorch.org/tutorials/beginner/former_torchies/parallelism_tutorial.html
-Create dataparallel block
class DataParallelModel(nn.Module):
def __init__(self):
super().__init__()
self.block1 = nn.Linear(10, 20)
# wrap block2 in DataParallel
self.block2 = nn.Linear(20, 20)
self.block2 = nn.DataParallel(self.block2)
self.block3 = nn.Linear(20, 20)
def forward(self, x):
x = self.block1(x)
x = self.block2(x)
x = self.block3(x)
return x
With the dataparallel model, we can operate just like the MPI style.
replicate: replicate a Module on multiple devices
scatter: distribute the input in the first-dimension
gather: gather and concatenate the input in the first-dimension
parallel_apply: apply a set of already-distributed inputs to a set of already-distributed models.
To give a better clarity, here function data_parallel composed using these collectives
def data_parallel(module, input, device_ids, output_device=None):
if not device_ids:
return module(input)
if output_device is None:
output_device = device_ids[0]
replicas = nn.parallel.replicate(module, device_ids)
inputs = nn.parallel.scatter(input, device_ids)
replicas = replicas[:len(inputs)]
outputs = nn.parallel.parallel_apply(replicas, inputs)
return nn.parallel.gather(outputs, output_device)
The part of the model can be in GPU and some can be in CPU. Compute some part in CPU and transfer the data and compute some part in GPU.
Let’s look at a small example of implementing a network where part of it is on the CPU and part on the GPU
device = torch.device("cuda:0")
class DistributedModel(nn.Module):
def __init__(self):
super().__init__(
embedding=nn.Embedding(1000, 10),
rnn=nn.Linear(10, 10).to(device),
)
def forward(self, x):
# Compute embedding on CPU
x = self.embedding(x)
# Transfer to GPU
x = x.to(device)
# Compute RNN on GPU
x = self.rnn(x)
return x