Home

Towards General Modality Translation with Contrastive and Predictive Latent Diffusion Bridge

Nimrod Berman* 🐪🛠️ , Omkar Joglekar* 🍺🛠️ , Eitan Kosman 🛠️ , Dotan Di Castro 🛠️ , Omri Azencot 🐪
🛠️ Bosch AI Center Haifa, 🐪 Ben Gurion University, 🍺 Technical University of Munich
*equal contribution

Method for Translating Between Any Two Arbitrary Modalities

Example 1
Translation Between multi-view image (left) to 3D shapes (right) Modalities

Example 2
Translation Between Low (up-16x16) Resolution and High (bottom-128x128) Modalites

Example 2
Translation Between Voice Representations and Face Representations

Example 2
Translation Sketch to Realistic Image

A General Modality Translation Framework

Diffusion models have achieved demonstrated huge success in generative tasks across images, audio, and text. However, applying them to modality translation — converting data from one modality to another (e.g., images to 3D shapes, low-res to high-res) — remains limited by assumptions like matching dimensionality or architecture-specific designs.

We introduce LDDBM, a general-purpose framework for modality translation using a latent extension of Denoising Diffusion Bridge Models (DDBMs). Our method operates in a shared latent space, avoids restrictive assumptions, and introduces two key innovations:

A contrastive loss that enforces semantic alignment across modalities
A predictive loss that directly improves translation quality

Our model performs well across diverse tasks — including multi-view to 3D generation, zero-shot image super-resolution, and scene occupancy prediction — establishing a new state-of-the-art for general modality translation.

Method

We extend DDBMs into a latent diffusion bridge, enabling translation between modalities of different shapes and semantics. The architecture consists of:

Modality-specific encoders and decoders
A shared latent space where translation occurs via a learned diffusion bridge
An encoder-decoder Transformer designed specifically for bridging heterogeneous representations

Our training loss combines:

A bridge loss (score matching)
A predictive loss to reconstruct the target
A contrastive loss (InfoNCE) to align latent pairs

🚀 Translate between any modalities — effortlessly🚀

With our framework, general modality translation is as easy as 1-2-3:

1️⃣ Gather paired data from your source and target domains

2️⃣ Plug in simple MLP-based encoders and decoders — no fancy architectures needed

3️⃣ Run our method and watch the bridge form — it's that simple!

Superior Results over state-of-the-art general multi-modal diffusion frameworks

Multiview-to-3D-Shape Generation task

Dataset: ShapeNet
Metric: 1-NNA ↓, IoU ↑
LDDBM outperformed EDM, DiT, and SiT baselines in both generative fidelity and shape accuracy.

Zero-Shot Low-to-High Resolution Generation task (16x16 to 128x128)

Zero-Shot Super-Resolution (16×16 → 128×128)

Dataset: FFHQ → CelebA-HQ
Metrics: PSNR ↑, SSIM ↑, LPIPS ↓
LDDBM produced sharper, more realistic face reconstructions than DiWa and other baselines.

Page updated

Google Sites

Report abuse