Hi, I'm Soubhik, a Research Scientist @GenAI at Meta AI.

I have joined Meta AI as a research scientist in GenAI. Previously, I have finished my PhD under the supervision of Michael J. Black where I focused on building generative models for digital humans in 2D and 3D. Throughout my research journey, I've had the opportunity to collaborate closely with Timo Bolkart and Justus Thies. Additionally, I have done two research internships during my PhD (approximately ~2 years in total). One is with Google AR hosted by Thabo Beeler, and the other is with Amazon Research mentored by Javier Romero.

Updates:

Selected Projects 

Generative 3D Neural Avatars

SCULPT is a novel 3D generative model for creating clothed and textured human meshes, using deep neural networks to represent geometry and appearance distribution. Due to limited textured 3D mesh datasets, SCULPT leverages medium-sized 3D scan datasets and large-scale 2D image datasets in an unpaired learning procedure. It also leverages large-scale language models for better disentanglement. The method is validated on the SCULPT dataset and compared to other state-of-the-art 3D generative models for clothed human bodies. CVPR 2024

Code and paper link: SCULPT: Shape-Conditioned Unpaired Learning of Pose-dependent Clothed and Textured Human Meshes

Soubhik Sanyal, Partha Ghosh, Jinlong Yang, Michael J Black, Justus Thies, Timo Bolkart

Unconditional Video Generation

We present an efficient video generative model that captures long-term dependencies using a hybrid tri-plane representation and a single latent code, reducing computational complexity by 50%. This approach, enhanced with an optical flow-based GAN module, generates high-fidelity videos at 256 × 256 resolution and 30 fps. Our model's efficacy is validated across multiple datasets. Arxiv 2023

Paper link: RAVEN: Rethinking Adversarial Video Generation with Efficient Tri-plane Networks

Partha Ghosh*, Soubhik Sanyal*, Cordelia Schmid , Bernhard Scholkopf  (Joint first author)

Synthesizing Human Images

SPICE is a self-supervised framework that synthesizes images of a person in novel poses from a single image, addressing the challenges and costs of obtaining paired training data. It leverages 3D information about the human body to maintain realism and consistency in generated images. SPICE outperforms previous unsupervised methods, achieving state-of-the-art performance on the DeepFashion dataset, and can generate temporally coherent videos from static images and pose sequences. ICCV 2021 (Oral)


Soubhik Sanyal, Alex Vorobiov, Timo Bolkart, Matt Loper, Betty Mohler, Larry Davis, Javier Romero, Michael J Black

Reconstructing 3D Pose & Shape

RingNet is a neural network that estimates 3D face shape, pose, and expressions from a single image without 2D-to-3D supervision. It leverages multiple images of an individual and automatically detected 2D face features, using a novel loss function that encourages consistent face shape across images with same identity. The model achieves expression invariance using the FLAME face representation. RingNet outperforms methods with 3D supervision and is evaluated using a new "not quite in-the-wild" (NoW) database with 3D head scans and high-resolution images. CVPR 2019


Soubhik Sanyal, Timo Bolkart, Haiwen Feng, Michael J. Black

Deep 3D Geometry Learning

The CoMA model is a versatile 3D face representation method that uses spectral convolutions on mesh surfaces for computer vision and graphics applications. It overcomes the limitations of traditional linear models by employing mesh sampling operations for a hierarchical representation. Trained on 20,466 meshes from 12 subjects, CoMA outperforms state-of-the-art models with 50% lower reconstruction error and 75% fewer parameters, proving its effectiveness in capturing non-linear facial variations. ECCV 2018


 Anurag Ranjan, Timo Bolkart, Soubhik Sanyal, Michael J. Black

Face & Object Recognition

Discriminative Pose-Free Descriptor (DPFD) tackles pose invariant matching in applications like face recognition and object matching. By using training examples at representative poses, virtual intermediate pose subspaces are generated. Images are then projected onto these subspaces, and a discriminative transform is applied to create a single feature vector (DPFD) for classification. The effectiveness of this approach is demonstrated through extensive experiments on the Multi-PIE and Surveillance Cameras Face datasets and its generalizability beyond faces is shown through experiments on matching objects across viewpoints.  ICCV 2015

Soubhik Sanyal, Sivaram Prasad Mudunuri, Soma Biswas


 Please check my google scholar for a full and updated list of my publications

Please follow me on Twitter and GitHub