Hi, I'm Soubhik, a Research Scientist @GenAI at Meta AI.

CV

Google Scholar

GitHub

Twitter

Linkedin

My PS website

I have joined Meta AI as a research scientist in GenAI. Previously, I have finished my PhD under the supervision of Michael J. Black where I focused on building generative models for digital humans in 2D and 3D. Throughout my research journey, I've had the opportunity to collaborate closely with Timo Bolkart and Justus Thies. Additionally, I have done two research internships during my PhD (approximately ~2 years in total). One is with Google AR hosted by Thabo Beeler, and the other is with Amazon Research mentored by Javier Romero.

Updates:

My new work on camera controlled video generation model in collaboration with Meta GenAI is accepted at ICML 2026
My new paper on text animation for video generation in collaboration with Meta GenAI is submitted to arxiv 2026
Some of my collaborative works in generative AI for images and videos with the team have been featured in Meta's Cannes Lion 2025 announcement.
I will be a co-chair for BMVA Digital Human Symposium on 28th May, 2025 at London, UK.
I have joined as Research Scientist at the GenAI org at Meta AI on November 2024
I have sucessfully defended my PhD Thesis on September 2024
One paper accepted at CVPR 2024
I started a Student Researcher position at Google with Thabo Beeler on December 1, 2023
I have been selected as a Top Reviewer for NeurIPS 2023.

Selected Projects

Camera Controlled Video Diffusion Models

Rays as Pixels unifies camera pose recovery and novel-view video generation in one video diffusion model, learning videos and camera trajectories together. By representing cameras as dense “ray pixels” in the same latent space as frames, it enables self-consistent pose prediction, camera-controlled generation, and joint video–trajectory synthesis. ICML 2026

Website: Rays as Pixels: Learning A Joint Distribution of Videos and Camera Trajectories

Meta AI

Text animation for Video genaration

TransText brings image-to-video models into layer-aware text animation, generating high-quality transparent glyph videos for practical dynamic visual design. With its Alpha-as-RGB strategy, it preserves powerful RGB priors while producing coherent RGBA animations with fine-grained, controllable effects. Arxiv 2026

Website: TransText: Alpha-as-RGB Representation for Transparent Text Animation

Meta AI

Generative 3D Neural Avatars

SCULPT is a novel 3D generative model for creating clothed and textured human meshes, using deep neural networks to represent geometry and appearance distribution. Due to limited textured 3D mesh datasets, SCULPT leverages medium-sized 3D scan datasets and large-scale 2D image datasets in an unpaired learning procedure. It also leverages large-scale language models for better disentanglement. The method is validated on the SCULPT dataset and compared to other state-of-the-art 3D generative models for clothed human bodies. CVPR 2024

Code and paper link: SCULPT: Shape-Conditioned Unpaired Learning of Pose-dependent Clothed and Textured Human Meshes

Soubhik Sanyal, Partha Ghosh, Jinlong Yang, Michael J Black, Justus Thies, Timo Bolkart

Unconditional Video Generation

We present an efficient video generative model that captures long-term dependencies using a hybrid tri-plane representation and a single latent code, reducing computational complexity by 50%. This approach, enhanced with an optical flow-based GAN module, generates high-fidelity videos at 256 × 256 resolution and 30 fps. Our model's efficacy is validated across multiple datasets. Arxiv 2023

Paper link: RAVEN: Rethinking Adversarial Video Generation with Efficient Tri-plane Networks

Partha Ghosh*, Soubhik Sanyal*, Cordelia Schmid , Bernhard Scholkopf (Joint first author)

Synthesizing Human Images

SPICE is a self-supervised framework that synthesizes images of a person in novel poses from a single image, addressing the challenges and costs of obtaining paired training data. It leverages 3D information about the human body to maintain realism and consistency in generated images. SPICE outperforms previous unsupervised methods, achieving state-of-the-art performance on the DeepFashion dataset, and can generate temporally coherent videos from static images and pose sequences. ICCV 2021 (Oral)

Soubhik Sanyal, Alex Vorobiov, Timo Bolkart, Matt Loper, Betty Mohler, Larry Davis, Javier Romero, Michael J Black

Reconstructing 3D Pose & Shape

RingNet is a neural network that estimates 3D face shape, pose, and expressions from a single image without 2D-to-3D supervision. It leverages multiple images of an individual and automatically detected 2D face features, using a novel loss function that encourages consistent face shape across images with same identity. The model achieves expression invariance using the FLAME face representation. RingNet outperforms methods with 3D supervision and is evaluated using a new "not quite in-the-wild" (NoW) database with 3D head scans and high-resolution images. CVPR 2019

Soubhik Sanyal, Timo Bolkart, Haiwen Feng, Michael J. Black

Deep 3D Geometry Learning

The CoMA model is a versatile 3D face representation method that uses spectral convolutions on mesh surfaces for computer vision and graphics applications. It overcomes the limitations of traditional linear models by employing mesh sampling operations for a hierarchical representation. Trained on 20,466 meshes from 12 subjects, CoMA outperforms state-of-the-art models with 50% lower reconstruction error and 75% fewer parameters, proving its effectiveness in capturing non-linear facial variations. ECCV 2018

Anurag Ranjan, Timo Bolkart, Soubhik Sanyal, Michael J. Black

Face & Object Recognition

Discriminative Pose-Free Descriptor (DPFD) tackles pose invariant matching in applications like face recognition and object matching. By using training examples at representative poses, virtual intermediate pose subspaces are generated. Images are then projected onto these subspaces, and a discriminative transform is applied to create a single feature vector (DPFD) for classification. The effectiveness of this approach is demonstrated through extensive experiments on the Multi-PIE and Surveillance Cameras Face datasets and its generalizability beyond faces is shown through experiments on matching objects across viewpoints. ICCV 2015

Soubhik Sanyal, Sivaram Prasad Mudunuri, Soma Biswas

Please check my google scholar for a full and updated list of my publications

Please follow me on Twitter and GitHub

Page updated

Google Sites

Report abuse