FLAME can perform realistic high fidelity editing on diverse set of images
Indian Institute of Science, Bangalore
Unconstrained Image generation with high realism is now possible using recent Generative Adversarial Networks (GANs). However, it is quite challenging to generate images with a given set of attributes. Recent methods use style-based GAN models to perform image editing by leveraging the semantic hierarchy present in the layers of the generator. We present Few-shot Latent-based Attribute Manipulation and Editing (FLAME), a simple yet effective framework to perform highly controlled image editing by latent space manipulation. Specifically, we estimate linear directions in the latent space (of a pre-trained StyleGAN) that controls semantic attributes in the generated image. In contrast to previous methods that either rely on large-scale attribute labeled datasets or attribute classifiers, FLAME uses minimal supervision of a few curated image pairs to estimate disentangled edit directions. FLAME can perform both individual and sequential edits with high precision on a diverse set of images while preserving identity. Further, we propose a novel task of Attribute Style Manipulation to generate diverse styles for attributes such as eyeglass and hair. We first encode a set of synthetic images of the same identity but having different attribute styles in the latent space to estimate an attribute style manifold. Sampling a new latent from this manifold will result in a new attribute style in the generated image. We propose a novel sampling method to sample latent from the manifold, enabling us to generate a diverse set of attribute styles beyond the styles present in the training set. FLAME can generate diverse attribute styles in a disentangled manner. We illustrate the superior performance of FLAME against previous image editing methods by extensive qualitative and quantitative comparisons. FLAME generalizes well on out-of-distribution images from art domain as well as on other datasets such as cars and churches.
Accepted in ACM Multimedia 2022
Samsung Research Institute Bangalore
Ever-increasing smartphone-generated video content demands intelligent techniques to edit and enhance videos on power-constrained devices. Most of the best performing algorithms for video understanding tasks (action recognition, localization) rely heavily on rich Spatio-temporal representations to make accurate predictions. For effective learning of these Spatio-temporal representations, it is crucial to understand the underlying object motion patterns present in the video. In this paper, we propose a novel approach for understanding object motion via motion type classification. The motion type classifier predicts an object’s motion type based on the directional patterns of object motion trajectory such as linear, projectile, oscillatory, local and random. We show that the object motion features learned from motion classification generalize well for multiple video analysis tasks like action recognition and video retrieval. Further, a recommendation system for video playback style based on the motion classifier predictions is presented. For action recognition, our learned representations achieved an accuracy improvement of 0.55 percent over Kinetics pre-trained representations on a subset of the HMDB51 dataset. To deploy on low-power computing devices like smartphones, our solution is optimized with an inference time of 200ms on Samsung Galaxy S20 mobile devices running the Qualcomm Snapdragon Adreno 650 GPU.
Accepted at SRVU Workshop ICCV-2021
https://arxiv.org/abs/2110.01015
Samsung Research Institute Bangalore
Faces are the most captured object in mobile phone cameras and face beautification applications are among the top applications based on user popularity. We developed face-beautification solution for Samsung mobile phones. Currently existing face beautification models performs high smoothing to remove non-uniformity in skin region but in the process the fine-grain skin texture is also lost. They usually apply synthetic texture after smoothing to make the output look natural but often times the output looks unrealistic. In our proposed approach, we transform the image into wavelet-domain and perform selective filtering for low frequency bands. It provides us with removing underlying irregularities and retaining fine skin texture. It also provided us with the control on the amount of beautification and texture retention based on subject's features like, age, gender, skin type. Our proposed solution outperformed all the existing solutions in competitor mobile devices. Also our work has been commercialized in a range of Samsung mobile phones, world-wide.
Accepted in CVPRW 2020
BTech Thesis: IIT Delhi
Digital Makeup transfer: Given a source and a target image transferring makeup from the source image to the target image.
The transferred makeup should blend in the scene to provide natural look. To this end, we have developed a complete framework which firstly relights the subject image to match the illumination of the target image. We have generated 3D face models from single image and used them for realistic relighting. Following that layer wise decomposition is done for relit source and target image and blending is done within corresponding layers to transfer makeup. Finally we have additional modules in our framework to add facial accessories. As we have generated 3D models of the source and target faces we were able to add accessories directly on 3D models which resulted in natural looking output.
Accepted in ICVGIP 2018
Samsung Research Institute Bangalore
Facial feature segmentation is well studied research problem due to it's large application in face editing applications. Traditional face parsing algorithms do pixel-wise segmentation which are though accurate but does not have smooth boundaries of predicted segments. Using these segmentation masks for face beautification and makeup transfer application results in unnatural results. We have proposed a shape aware segmentation algorithm that create segmentation masks which adhere to underlying template shape of the facial feature. We have build and encoder decoder architecture where encoder predicts the facial landmark points which are then used by the decoder to generate template segments. The training was done using multi-objective loss function. As a result the encoder is trained under the supervision of segmentation mask which led to state-of-the-art results in facial landmark predictions.
Samsung Research Institute Bangalore
As faces are most captured object from mobile phone cameras and often due to size limitation of these camera sensors, details of faces are being lost while capturing. There are diverse set of approaches available for single image SR in the wild but due to run-time and memory constraints it is difficult to be port them into mobile devices. By exploiting the face structure information that is there in every face image, it is possible to develop an architecture with fewer parameters for face super-resolution. I developed a super-resolution architecture based on conditional GAN framework, which super-resolves a face image from 32x32 to 128x128. The architecture is inspired by SRGAN and Celeb-A dataset was used for training and validation of our system. We used perceptual loss using VGG feature extractor between the generated image the real image and adversarial loss for training of the architecture.
IIT Delhi
Active speaker detection for videos in the wild is a very difficult task. There could be multiple people speaking simultaneously in a scene, the speaker might not be visible in the scene and so on. To simplify the problem we targeted a dataset of videos where artists were performing on stage. We started with 7 stage artists and downloaded 5~10 videos for each artist. The goal was to develop a system which given any video can classify it to one of these 8 artist classes (7 artists + unknown). We relied on some heuristic to curate our training and testing data. Given a video, we first run a face detector for each frame and all the frames that have 1~2 face boxes detected we annotated those frame with the artist present in the video. For the remaining frames where number of face boxes are more than 2, we annotated them as unknown class. While training, we first used of-the-shelf object detector to detect person in the frame and this becomes our region of interest. Following that, we train a CNN-classifier that takes these ROI as input and classify it to one of the artists categories. We achieved significant performance improvement as compared to single shot version where directly the video frames were passed for classification.
IIT Delhi
Implemented a climbing heuristic to render climbers on required objects in a scene to enhance content creation in computer graphics applications
Built a graph which is a minimal abstract representation of plant for production of leaves and branches
Traversed the nodes in graph with geometry with materials and textures that can be rendered finally
IIT Delhi
Modeled a frog to be represented as a hierarchical model with an articulated structure
Created animation module by interpolating key frames with diffuse, specular and ambient components
Made an interactive game with multiple frogs which run behind use controlled insects
IIT Delhi
Implemented recursive ray tracing to generate an image of virtually generated 3D model by tracing path of light through pixels
Implemented global illumination model with reflection, refraction & shadows and local illumination model with diffuse, specular and ambient components
IIT Delhi
Realized 2D feature detection & description algorithm in nonlinear scale spaces using nonlinear diffusion filtering
Nonlinear scale space built using Additive Operator Splitting (AOS) and variable conductance diffusion
Feature Detection was done using Hessian at multiple scale levels and Feature Description was done using SURF descriptors