Nimrod Berman* ๐ช๐ ๏ธ ,ย Omkar Joglekar* ๐บ๐ ๏ธ ,ย Eitan Kosman ย ๐ ๏ธย , Dotan Di Castro ๐ ๏ธ ,ย Omri Azencot ๐ช
๐ ๏ธ ย Bosch AI Center Haifa,ย ๐ช Ben Gurion University,ย ๐บย Technical University of Munich
*equal contribution
Diffusion models have achieved demonstrated huge success in generative tasks across images, audio, and text. However, applying them to modality translation โ converting data from one modality to another (e.g., images to 3D shapes, low-res to high-res) โ remains limited by assumptions like matching dimensionality or architecture-specific designs.
We introduce LDDBM, a general-purpose framework for modality translation using a latent extension of Denoising Diffusion Bridge Models (DDBMs). Our method operates in a shared latent space, avoids restrictive assumptions, and introduces two key innovations:
A contrastive loss that enforces semantic alignment across modalities
A predictive loss that directly improves translation quality
Our model performs well across diverse tasks โ including multi-view to 3D generation, zero-shot image super-resolution, and scene occupancy prediction โ establishing a new state-of-the-art for general modality translation.
Method
We extend DDBMs into a latent diffusion bridge, enabling translation between modalities of different shapes and semantics. The architecture consists of:
Modality-specific encoders and decoders
A shared latent space where translation occurs via a learned diffusion bridge
An encoder-decoder Transformer designed specifically for bridging heterogeneous representations
Our training loss combines:
A bridge loss (score matching)
A predictive loss to reconstruct the target
A contrastive loss (InfoNCE) to align latent pairs
๐ Translate between any modalities โ effortlessly๐
With our framework, general modality translation is as easy as 1-2-3:
1๏ธโฃ Gather paired data from your source and target domains
2๏ธโฃ Plug in simple MLP-based encoders and decoders โ no fancy architectures needed
3๏ธโฃ Run our method and watch the bridge form โ it's that simple!ย
Dataset: ShapeNet
Metric: 1-NNA โ, IoU โ
LDDBM outperformed EDM, DiT, and SiT baselines in both generative fidelity and shape accuracy.
Dataset: FFHQ โ CelebA-HQ
Metrics: PSNR โ, SSIM โ, LPIPS โ
LDDBM produced sharper, more realistic face reconstructions than DiWa and other baselines.