Pytorch data parallel

pytorch distributed training

https://pytorch.org/tutorials/beginner/former_torchies/parallelism_tutorial.html

-Create dataparallel block

class DataParallelModel(nn.Module):

def __init__(self):

super().__init__()

self.block1 = nn.Linear(10, 20)

# wrap block2 in DataParallel

self.block2 = nn.Linear(20, 20)

self.block2 = nn.DataParallel(self.block2)

self.block3 = nn.Linear(20, 20)

def forward(self, x):

x = self.block1(x)

x = self.block2(x)

x = self.block3(x)

return x

With the dataparallel model, we can operate just like the MPI style.

replicate: replicate a Module on multiple devices
scatter: distribute the input in the first-dimension
gather: gather and concatenate the input in the first-dimension
parallel_apply: apply a set of already-distributed inputs to a set of already-distributed models.

To give a better clarity, here function data_parallel composed using these collectives

def data_parallel(module, input, device_ids, output_device=None):

if not device_ids:

return module(input)

if output_device is None:

output_device = device_ids[0]

replicas = nn.parallel.replicate(module, device_ids)

inputs = nn.parallel.scatter(input, device_ids)

replicas = replicas[:len(inputs)]

outputs = nn.parallel.parallel_apply(replicas, inputs)

return nn.parallel.gather(outputs, output_device)

The part of the model can be in GPU and some can be in CPU. Compute some part in CPU and transfer the data and compute some part in GPU.

Let’s look at a small example of implementing a network where part of it is on the CPU and part on the GPU

device = torch.device("cuda:0")

class DistributedModel(nn.Module):

def __init__(self):

super().__init__(

embedding=nn.Embedding(1000, 10),

rnn=nn.Linear(10, 10).to(device),

)

def forward(self, x):

# Compute embedding on CPU

x = self.embedding(x)

# Transfer to GPU

x = x.to(device)

# Compute RNN on GPU

x = self.rnn(x)

return x

Page updated

Report abuse