The proposed inpainting system is trained in an end-to-end manner. Given an input image with a hole I(in), its corresponding binary mask M (where the value 0 is for known pixels and 1 for unknown ones), the output I(out) predicted by the network, and the ground-truth image I(gt).
The architecture of the proposed dense multi-scale fusion block (DMFB). Here, “Conv-3-8” indicates 3 × 3 convolution layer with the dilation rate of 8 and ⊕ is element-wise summation. Instance normalization (IN) and ReLU activation layers followed by the first convolution, second column convolutions and concatenation layer are omitted for brevity. The last convolutional layer only connects an IN layer. The number of output channels for each convolution is set to 64 except for the last 1 × 1 convolution (256 channels) in DMFB.
As depicted in the image below, the proposed framework consists of two branches: a generator branch and a discriminator branch. The generator branch produces a plausible painted result, and the discriminator branch conducts adversarial training. For image inpainting tasks, the size of the receptive fields should be sufficiently large. To enlarge the receptive fields and ensure dense convolution kernels simultaneously, we propose our dense multi-scale fusion block (DMFB, as depicted in the image above). Specifically, the first convolution on the left in DMFB reduces the channels of input features to 64 for decreasing the parameters, and then these processed features are sent to four branches to extract multi-scale features. Through a cumulative addition fashion, we can get dense multi-scale features from the combination of various sparse multi-scale features. The combination part can be formulated as:
The following step is the fusion of concatenated features simply using a 1 × 1 convolution. In a word, this basic block especially enhances the general dilated convolution and has fewer parameters than large kernels.
The framework of the proposed method. The activation layer followed by each “convolution + norm” or convolution layer in the generator is omitted for conciseness. The activation function adopts ReLU except for the last convolution (Tanh) in the generator. Blue dotted box indicates the upsampler module (TConv-4 is 4 × 4 transposed convolution) and “s2” denotes the stride of 2.
Self-guided regression loss
The semantic structure preservation issue is addressed through this loss function. The proposed approach schemes to take a self-guided regression constraint to correct the image semantic level estimation. In short, the discrepancy map between generated contents and the corresponding ground truth is computed, to navigate the similarity measure of the feature map hierarchy from the pre-trained VGG19 network.
Geometrical alignment loss
In the typical solutions, the metric evaluation in higher-level feature space is only achieved using pixel-based loss, e.g., L1 or L2. It doesn’t take the alignment of each high-level feature map semantic hub into account. To better measure the distance between high-level features belong to prediction and ground-truth, the geometrical alignment constraint is imposed. This term helps the generator create a plausible image that is aligned with the target image in position.
Feature matching losses
The VGG feature matching loss compares the activation maps in the intermediate layers of well-trained VGG19 model. Another discriminator feature loss is introduced in the local branch, reasonably assuming that the output images are consistent with the ground-truth images under any measurements (i.e., any high-dimensional spaces).
Adversarial loss
For improving the visual quality of inpainted results, relativistic average discriminator is used, which is the same as ESRGAN, the recent state-of-the-art perceptual image super-resolution algorithm.
Visualization of average VGG feature maps.