Splatent: Splatting Diffusion Latents for Novel View Synthesis

1Amazon Prime Video     2Tel-Aviv University    

TLDR: We show that the VAE space of current diffusion models is not 3D consistent. To recover details lost in 3D reconstruction, we suggest a 2D diffusion based method to restore lost information.

Splatent Teaser

Abstract

Radiance field representations have recently been explored in the latent space of VAEs that are commonly used by diffusion models. This direction offers efficient rendering and seamless integration with diffusion-based pipelines. However, these methods face a fundamental limitation: The VAE latent space lacks multi-view consistency, leading to blurred textures and missing details during 3D reconstruction. Existing approaches attempt to address this by fine-tuning the VAE, at the cost of reconstruction quality, or by relying on pre-trained diffusion models to recover fine-grained details, at the risk of some hallucinations. We present Splatent, a diffusion-based enhancement framework designed to operate on top of 3D Gaussian Splatting (3DGS) in the latent space of VAEs. Our key insight departs from the conventional 3D-centric view: rather than reconstructing fine-grained details in 3D space, we recover them in 2D from input views through multi-view attention mechanisms. This approach preserves the reconstruction quality of pretrained VAEs while achieving faithful detail recovery. Evaluated across multiple benchmarks, Splatent establishes a new state-of-the-art for VAE latent radiance field reconstruction. We further demonstrate that integrating our method with existing feed-forward frameworks, consistently improves detail preservation, opening new possibilities for high-quality sparse-view 3D reconstruction. We will release our code upon publication.

360° Novel View Synthesis

Qualitative Comparisons

Multi-View Inconsistencies in VAE Latents

Our work is driven by the observation that existing VAE models, such as Stable Diffusion VAE, produce latent representations that lack 3D consistency, limiting their direct use for 3D scene reconstruction and novel view synthesis. This limitation arises from two related spectral deficiencies. First, the latent spaces fail to maintain equivariance under basic spatial transformations like scaling and rotation. Second, and more importantly, view-dependent high-frequency components, essential for accurate decoding, exhibit the most severe 3D inconsistencies across viewpoints, unlike in RGB space. When optimizing 3D Gaussian splatting in latent space, this spectral inconsistency becomes particularly problematic.

Spectral Analysis

VAE latents spectral analysis: Magnitude spectrum of the latent image using different VAE models. During 3DGS optimization, inconsistent high frequencies average out, leaving only low-frequency components and causing blurry decoded images.

Method

We optimize 3D Gaussian splatting in VAE latent space, then refine rendered latents using a diffusion-based module. Our key insight: rather than reconstructing details in 3D, we recover them in 2D from input views through multi-view attention mechanisms. We condition a diffusion model on reference views arranged in a spatial grid. During denoising, multi-view attention propagates high-frequency details from references to the rendered latent, recovering lost information while keeping the VAE frozen.

Splatent Architecture

Splatent Architecture: Our two-stage pipeline first optimizes latent 3DGS, then refines rendered latents using diffusion with multi-view attention to recover high-frequency details.

BibTeX

@article{splatent2025,
      title={Splatent: Splatting Diffusion Latents for Novel View Synthesis}, 
      author={Or Hirschorn and Omer Sela and Inbar Huberman-Spiegelglas and Netalee Efrat and Eli Alshan and Ianir Ideses and Frederic Devernay and Yochai Zvik and Lior Fritz},
      year={2025},
      eprint={2512.09923},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.09923}, 
}