Computer Vision

Portrait Photo Retouching using Deep Guided Filters

Hardik Chauhan, Shubham Agarwal, Patrick Chickey

Abstract

We are trying to solve the problem of portrait photo retouching (PPR) which aims at enhancing the visual quality of a collection of flat-looking portrait photos. There is a large demand for portrait photo retouching, as many people like to take portrait photos, but simply taking a picture with a nice camera does not produce as high quality of an image as can be achieved by an expert photographer editing such a picture. Of course, this would require hiring a professional to perform such work, which can be expensive and time-consuming. If there were an automated system to transform high-quality pictures into the desired final image, as if they were edited by a professional, then such images would become more accessible.

There do exist some machine learning models for image retouching, but for portrait photo retouching specifically, to our knowledge this is the only example, Liang et al. [4], which we intend to replicate. This is largely due to the fact that PPR has its special and practical requirements such as human-region priority (HRP) and group-level consistency (GLC). These are difficult to achieve using machine learning without a portrait-specific dataset, which Liang et al. have developed and shown for their model [4]. Importantly, this model [4] is also better than other state-of-the-art techniques when applied to this dataset. As this is relatively new territory and the dataset they used is now publicly available, we are interested in attempting to replicate their findings, as well as improve on their model where possible.

Figure 1: Two examples of groups of photos from the PPR10K dataset. The top half are raw; the bottom half retouched by an expert for better visual quality and group-level consistency.

Dataset and Data Pre-Processing

For this task, we are using the PPR10K dataset developed by Liang et al. [3]. While there do exist many general-purpose datasets for retouching or enhancing photos, as well as models to automate this process, they do not fulfill the requirements mentioned previously that are specific to PPR, HRP and GLC. HRP requires that the photo retouching pays attention to the human regions of the photo, and GLC requires that the tone of photos be consistent within a group that shares the same subjects and setting. To achieve these requirements, Liang et al. created a PPR-specific dataset with thousands of 4k and 8k photos, including high-resolution segmentation masks for the human region, and all manually retouched by experts in the field for high-quality ground truth.

The dataset consists of 11,161 4k and 8k images divided into 1,681 groups, with each group containing 3-18 photos with the same subjects in the same setting across the group. The photos were taken with various high quality DSLR devices and are very diverse in terms of scene, subject, lighting, and camera settings. As the name ”portrait photo retouching” suggests, the main focus of the images in the dataset is people. This means the diversity comes through changing the people and the location, including subjects both old and young, locations both indoor and outdoor, and lighting conditions day and night, as well as at different times of the year. In order to achieve the high-quality ground truth, 3 different experts were hired to retouch each of the raw photos, so every photo in the dataset has 3 possible ground-truths that can be used in training the model to achieve the goals of HRP and GLC. Before they are used for either training or validation, the raw and retouched photos are scaled down from 4k/8k to 360p for faster training times. Figure 1 shows what a group of photos looks like and how the raw photos compared to those by experts, and figure 2 showcases the diversity of the dataset.

The dataset is also bolstered via data augmentation. This is where transformations and modifications are made to existing images in the dataset to add more diversity and increase the number of training and validation points for creating the model. Here, it is performed with 6 main visual attributes: temperature, tint, exposure, highlights, con- trasts, and saturation. There are ranges set for all of these values based on advice from the expert retouchers to ensure they are still realistic; each image is given random modifica- tions within these ranges on each attribute to create the additional images to augment the dataset. Modifying these attributes contributes to the dataset having more diverse and representative lighting and color distributions. After data augmentation, the dataset consists of 53,250 training pairs.

Figure 2: Examples to show the diversity of the dataset

Baseline Model

We learn image-adaptive 3-dimensional lookup tables to achieve fast photo enhancement. 3D LUTs are widely used for manipulating color and tone of photos, but they are usually manually tuned and fixed in camera imaging pipeline or photo editing tools. We learn multiple 3D LUTs and importance of each 3D LUT using Convolution Neural Network(CNN) architecture. An intuitive idea is to learn a classifier to perform scene classification, and then use different 3D LUTs to enhance different images. The CNN weight predictor aims to understand the global context such as brightness, color and tones of the image to output content-dependent weights. Therefore, it only needs to work on the down-sampled version of the input image to largely save computational cost. We used the human region weighted mean squared loss on predicted color images to optimize a model. For the human-centered measures, we set w = 1 for backgrounds and w = 5 for human-regions for faster convergence.

Figure 3: Baseline 3D LUT Model

Deep Guided Filter

We found that the previous baseline model was not good at preserving edge information. Hence, we explored a guided filter-based network in this project. We perform photo-retouching in two steps. Firstly, we downsample the high-resolution images to generate low-resolution images and apply the Deep convolution-based LR-Net model to generate low-resolution output. In the second step, We use High-resolution input and low-resolution input-output as input to the guided filter. In a guided filter, they calculate the pixel-wise linear model using the inputs and upsample it to apply it to generate high-resolution output. The proposed model is real-time and contains a very less number of parameters since most of the operations performed on the low-resolution input and guided filter is only used for joint upsampling which is nothing but a linear model.

Figure 4: Deep Guided Filter

Evaluation

In this section, we discuss our evaluation setting for this project. We follow the evaluation protocol similar to the [3] paper for a fair comparison. They are:-

The peak signal-to-noise-ratio (PSNR) and CIELAB colour difference (∆ Eab, 2 norms of difference between prediction and ground truth). These are general-purpose photo enhancement metrics.
Human Region Centered: PSNR^(HC), ∆Eab^(HC) (baseline metrics with human region mask applied to images). This assigns higher priority to human regions in portrait photos to enable better photo enhancement of human localised regions.
Group Level Consistency: GLC (sum of the variance of a group of images across colour channels). This metric measures the variations in tone and colour among a group of photos. It ensures robustness to the change of image content, global tone, and colour appearance, among similar photos

RESULTS

Once we implemented the Guided filter with the PPR10K dataset, we compared our results against the basic 3D LUT model used in the Liang et al. paper. We knew going into this that the 3D LUT struggled some with edge preservation, so we hoped that the guided filter output would perform better here. At ths resolution it may be hard to tell, but the image on the right, the guided filter output, does preserve edges a little better than the 3D LUT result, as well as having better local tones. However, you can also tell that the color mapping isn’t quite as nice in the guided filters output as it is for the 3D LUT, and it’s not quite as close to the ground truth.

Figure 5: Guided Filter Example

Here we see some actual statistics for our results, comparing against the results of the basic 3D LUT from liang et al. With respect to closeness to ground truth and measuring the appropriate metrics such as peak signal to noise ratio, color different, human-centered PSNR, and group level consistency, the guided filter result does not quite measure up to the highest performing 3D LUT results. However, it does still perform better than HDRNet and CSRNet, two other networks, despite their large amount of parameters. This is partly because they work with high resolution data, while our guided filter result, downscales the images for its training and prediction before upscaling back for the final result. Overall, although it does not replace the state of the art, the guided filter implementation using this dataset works relatively well.

Figure 6: Guided Filter Results

References

[1] Durand, Frédo, and Julie Dorsey. "Fast bilateral filtering for the display of high-dynamic-range images." Proceedings of the 29th annual conference on Computer graphics and interactive techniques. 2002.

[2] He, Kaiming, Jian Sun, and Xiaoou Tang. "Guided image filtering." IEEE transactions on pattern analysis and machine intelligence 35.6 (2012): 1397-1409.

[3] Liang, Jie, et al. "PPR10K: A Large-Scale Portrait Photo Retouching Dataset with Human-Region Mask and Group-Level Consistency." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021.

[4] Zeng, Hui, et al. "Learning image-adaptive 3D lookup tables for high performance photo enhancement in real-time." IEEE Transactions on Pattern Analysis and Machine Intelligence (2020).