Projects

FLAME can perform realistic high fidelity editing on diverse set of images

Everything is there in Latent Space: Attribute Editing and Attribute Style Manipulation by StyleGAN Latent Space Exploration

Indian Institute of Science, Bangalore

Unconstrained Image generation with high realism is now possible using recent Generative Adversarial Networks (GANs). However, it is quite challenging to generate images with a given set of attributes. Recent methods use style-based GAN models to perform image editing by leveraging the semantic hierarchy present in the layers of the generator. We present Few-shot Latent-based Attribute Manipulation and Editing (FLAME), a simple yet effective framework to perform highly controlled image editing by latent space manipulation. Specifically, we estimate linear directions in the latent space (of a pre-trained StyleGAN) that controls semantic attributes in the generated image. In contrast to previous methods that either rely on large-scale attribute labeled datasets or attribute classifiers, FLAME uses minimal supervision of a few curated image pairs to estimate disentangled edit directions. FLAME can perform both individual and sequential edits with high precision on a diverse set of images while preserving identity. Further, we propose a novel task of Attribute Style Manipulation to generate diverse styles for attributes such as eyeglass and hair. We first encode a set of synthetic images of the same identity but having different attribute styles in the latent space to estimate an attribute style manifold. Sampling a new latent from this manifold will result in a new attribute style in the generated image. We propose a novel sampling method to sample latent from the manifold, enabling us to generate a diverse set of attribute styles beyond the styles present in the training set. FLAME can generate diverse attribute styles in a disentangled manner. We illustrate the superior performance of FLAME against previous image editing methods by extensive qualitative and quantitative comparisons. FLAME generalizes well on out-of-distribution images from art domain as well as on other datasets such as cars and churches.

Accepted in ACM Multimedia 2022

https://sites.google.com/view/flamelatentediting

Spatio-Temporal Video Representation Learning via motion type classification

Samsung Research Institute Bangalore

Ever-increasing smartphone-generated video content demands intelligent techniques to edit and enhance videos on power-constrained devices. Most of the best performing algorithms for video understanding tasks (action recognition, localization) rely heavily on rich Spatio-temporal representations to make accurate predictions. For effective learning of these Spatio-temporal representations, it is crucial to understand the underlying object motion patterns present in the video. In this paper, we propose a novel approach for understanding object motion via motion type classification. The motion type classifier predicts an object’s motion type based on the directional patterns of object motion trajectory such as linear, projectile, oscillatory, local and random. We show that the object motion features learned from motion classification generalize well for multiple video analysis tasks like action recognition and video retrieval. Further, a recommendation system for video playback style based on the motion classifier predictions is presented. For action recognition, our learned representations achieved an accuracy improvement of 0.55 percent over Kinetics pre-trained representations on a subset of the HMDB51 dataset. To deploy on low-power computing devices like smartphones, our solution is optimized with an inference time of 200ms on Samsung Galaxy S20 mobile devices running the Qualcomm Snapdragon Adreno 650 GPU.

Accepted at SRVU Workshop ICCV-2021

https://arxiv.org/abs/2110.01015

Texture-aware face beautification

Samsung Research Institute Bangalore

Faces are the most captured object in mobile phone cameras and face beautification applications are among the top applications based on user popularity. We developed face-beautification solution for Samsung mobile phones. Currently existing face beautification models performs high smoothing to remove non-uniformity in skin region but in the process the fine-grain skin texture is also lost. They usually apply synthetic texture after smoothing to make the output look natural but often times the output looks unrealistic. In our proposed approach, we transform the image into wavelet-domain and perform selective filtering for low frequency bands. It provides us with removing underlying irregularities and retaining fine skin texture. It also provided us with the control on the amount of beautification and texture retention based on subject's features like, age, gender, skin type. Our proposed solution outperformed all the existing solutions in competitor mobile devices. Also our work has been commercialized in a range of Samsung mobile phones, world-wide.

Accepted in CVPRW 2020

http://openaccess.thecvf.com/content_CVPRW_2020/papers/w31/Velusamy_FabSoften_Face_Beautification_via_Dynamic_Skin_Smoothing_Guided_Feathering_and_CVPRW_2020_paper.pdf

Scene Adaptive Cosmetic Makeup Transfer

BTech Thesis: IIT Delhi

Digital Makeup transfer: Given a source and a target image transferring makeup from the source image to the target image.

The transferred makeup should blend in the scene to provide natural look. To this end, we have developed a complete framework which firstly relights the subject image to match the illumination of the target image. We have generated 3D face models from single image and used them for realistic relighting. Following that layer wise decomposition is done for relit source and target image and blending is done within corresponding layers to transfer makeup. Finally we have additional modules in our framework to add facial accessories. As we have generated 3D models of the source and target faces we were able to add accessories directly on 3D models which resulted in natural looking output.

Accepted in ICVGIP 2018

https://dl.acm.org/doi/pdf/10.1145/3293353.3293385

Proposed Makeup-Tranfer framework

a) Target Image b) Source Image c) Makeup Transfer Result d) Nose ring addedIn image d) one can observe the shadow of the added nose ring which was possible due to 3D face model

Proposed Encoder Decoder architecture for face parsing

An example of lip segmentation by deep template. 2nd column contains results of pixel-wise segmentation and 3rd column contains DeepTemplate results.

DeepTemplate: Shape aware face parsing

Samsung Research Institute Bangalore

Facial feature segmentation is well studied research problem due to it's large application in face editing applications. Traditional face parsing algorithms do pixel-wise segmentation which are though accurate but does not have smooth boundaries of predicted segments. Using these segmentation masks for face beautification and makeup transfer application results in unnatural results. We have proposed a shape aware segmentation algorithm that create segmentation masks which adhere to underlying template shape of the facial feature. We have build and encoder decoder architecture where encoder predicts the facial landmark points which are then used by the decoder to generate template segments. The training was done using multi-objective loss function. As a result the encoder is trained under the supervision of segmentation mask which led to state-of-the-art results in facial landmark predictions.

Face Image Super-Resolution

Samsung Research Institute Bangalore

As faces are most captured object from mobile phone cameras and often due to size limitation of these camera sensors, details of faces are being lost while capturing. There are diverse set of approaches available for single image SR in the wild but due to run-time and memory constraints it is difficult to be port them into mobile devices. By exploiting the face structure information that is there in every face image, it is possible to develop an architecture with fewer parameters for face super-resolution. I developed a super-resolution architecture based on conditional GAN framework, which super-resolves a face image from 32x32 to 128x128. The architecture is inspired by SRGAN and Celeb-A dataset was used for training and validation of our system. We used perceptual loss using VGG feature extractor between the generated image the real image and adversarial loss for training of the architecture.

Proposed architecture for face Super-resolution

Results for 4x super-resolution

Proposed approach for speaker detection

Heat maps for speaker detection. It can be observed that apart from face, other regions of the ROI have high weight in predicting the person. For example for speaker 1, mic and his hands are important clue and for speaker 2 his turban has strong clue for classification

Dominant Speaker Detection in Short Videos

IIT Delhi

Active speaker detection for videos in the wild is a very difficult task. There could be multiple people speaking simultaneously in a scene, the speaker might not be visible in the scene and so on. To simplify the problem we targeted a dataset of videos where artists were performing on stage. We started with 7 stage artists and downloaded 5~10 videos for each artist. The goal was to develop a system which given any video can classify it to one of these 8 artist classes (7 artists + unknown). We relied on some heuristic to curate our training and testing data. Given a video, we first run a face detector for each frame and all the frames that have 1~2 face boxes detected we annotated those frame with the artist present in the video. For the remaining frames where number of face boxes are more than 2, we annotated them as unknown class. While training, we first used of-the-shelf object detector to detect person in the frame and this becomes our region of interest. Following that, we train a CNN-classifier that takes these ROI as input and classify it to one of the artists categories. We achieved significant performance improvement as compared to single shot version where directly the video frames were passed for classification.

Creation and Rendering of Complex Climbing Plants

IIT Delhi

Implemented a climbing heuristic to render climbers on required objects in a scene to enhance content creation in computer graphics applications
Built a graph which is a minimal abstract representation of plant for production of leaves and branches
Traversed the nodes in graph with geometry with materials and textures that can be rendered finally

The above figure show the final rendering of the climbing plants under various lighting conditions, various background and floor textures. Note that the growth of the branches is controlled through parameters but is there is still a large variation in the outputs

Growth of a branch of climbing tree

Generated wire-frame in 2 dimensions

Hierarchical Modelling and Animation of a Frog

IIT Delhi

Modeled a frog to be represented as a hierarchical model with an articulated structure
Created animation module by interpolating key frames with diffuse, specular and ambient components
Made an interactive game with multiple frogs which run behind use controlled insects

Ray Tracer

IIT Delhi

Implemented recursive ray tracing to generate an image of virtually generated 3D model by tracing path of light through pixels
Implemented global illumination model with reflection, refraction & shadows and local illumination model with diffuse, specular and ambient components

Image matching Using Kaze Features

IIT Delhi

Realized 2D feature detection & description algorithm in nonlinear scale spaces using nonlinear diffusion filtering
Nonlinear scale space built using Additive Operator Splitting (AOS) and variable conductance diffusion
Feature Detection was done using Hessian at multiple scale levels and Feature Description was done using SURF descriptors

Page updated

Google Sites

Report abuse