Per pixel loss is calculated by averaging over the difference of all the pixels in the ground truth image and the generated image.
We can use either L1 loss or L2 loss for the same. It has been mentioned in literature that using L2 loss leads to bland images as the network tries to minimize the total loss by making dull choices. L1 loss is considered to be a better choice.
We experimented with using both for the per-pixel loss.
Johnson et al. introduces the concept of perceptual loss in a network. We can extract high level features from pre-trained networks (VGG-16) for our generated image as well as the ground truth image. L2 loss is used to calculate perceptual loss.
By using the high level features from a pre-trained network, we are able to understand the structural and style similarity between the two images. Image to Image translation problems are multi modal and have no unique solution. By comparing the style and structural similarity between the two images, we can have better understanding of the performance of our model.
We experimented with training a network using -
Only per-pixel L1 loss
Only per-pixel L2 loss
Per-pixel L1 loss and perceptual loss
Per-pixel L2 loss and perceptual loss
Per-pixel L1 loss and perceptual loss using Manhattan distance
Only perceptual loss