Dense 3D Reconstruction from Monocular Video via Learning-based SLAM and Gaussian Splatting

Jiacan Li* | Yuzhen Song* | Yiming Ma*

Southern University of Science and Technology

Abstract

Methodology

Experiments

Experimental Setup

Phase I: Quantitative SLAM Benchmarking

Phase II: Geometric Robustness & Scene Saliency

Phase III: Rendering Fidelity & Convergence

Real-time Execution & Hardware Constraints

Interactive Demo

🎥 Demo

💻 Dataset

📝 Paper

test7_3dgs_video_iter7000_11frames.mp4

test10_3dgs_video_iter7000_SOR_opReset2000.mp4

Abstract

We present SplatSLAM, an end-to-end indoor scene reconstruction pipeline that bridges the gap between learning-based dense SLAM and photo-realistic novel view synthesis. Our method leverages MASt3R-SLAM to recover accurate camera trajectories and dense geometric point clouds from monocular RGB video captured by consumer smartphones.

To address the inherent noise and outliers in raw SLAM outputs, we introduce a standardized point cloud post-processing workflow—including Statistical Outlier Removal (SOR) and voxel downsampling—as a crucial transition module. By integrating these optimized priors, we initialize and train 3D Gaussian Splatting (3DGS), enabling real-time, high-fidelity rendering of complex indoor environments.

Evaluations on the 7-Scenes benchmark and self-collected SUSTech campus datasets demonstrate that our pipeline significantly improves localization accuracy compared to traditional feature-based SLAM, while delivering superior visual reconstruction quality.

🚀 End-to-End Workflow: Seamlessly bridging monocular mobile video input to interactive 3D digital twins.

📍 High Precision: Achieving sub-10cm trajectory accuracy (ATE RMSE) on public benchmarks without camera calibration.

🧹 Noise Robustness: Integrated SOR Denoising to eliminate floaters and SLAM artifacts for cleaner reconstruction.

✨ Photo-realistic: Superior rendering quality with PSNR up to 49.46 dB for complex indoor novel view synthesis.

Methodology

Our pipeline integrates advanced transformer-based SLAM with explicit radiance field rendering to achieve high-fidelity reconstruction. The workflow is divided into four core modules:

1. Data Preprocessing & Standardization

We capture 4K monocular videos using smartphones and extract frames at a fixed rate of 10 FPS using FFmpeg. This standardized preprocessing unifies image resolution and file naming conventions, ensuring compatibility across both public benchmarks (7-Scenes) and self-collected SUSTech datasets.

2. Dense Tracking and Mapping (MAST3R-SLAM)

The system employs MAST3R (Multi-view Assistant for 3D Reconstruction) as the SLAM front-end. By leveraging pixel-level transformer matching and global bundle adjustment, it simultaneously estimates accurate camera trajectories (poses.txt) and generates a dense, colored raw point cloud (raw_map.ply) without requiring depth sensors.

3. Standardized Point Cloud Refinement

To bridge the gap between noisy SLAM outputs and high-quality 3DGS training, we implement a self-designed post-processing workflow in CloudCompare:

Statistical Outlier Removal (SOR): Eliminating floating noise caused by motion blur or depth jitter.

Voxel Downsampling: Uniformly reducing point density to optimize training efficiency.

Manual Masking: Filtering irrelevant background outliers to focus on the target scene.

4. 3D Gaussian Splatting & NVS

The refined point cloud serves as a geometric prior to initialize Gaussian ellipsoids. Using the splatfacto framework, the system iteratively optimizes the position, covariance, and color of each Gaussian through photometric loss minimization. The final model supports real-time interactive rendering and smooth novel view synthesis.

Experiments

Experimental Setup

Dataset:

7-Scenes Dataset: Public benchmark for indoor SLAM evaluation.

SUSTech Scene: Self-collected video from campus classrooms and kitchens.

Hardware:

GPU: NVIDIA RTX 4090 (24GB VRAM)

Software: PyTorch, Nerfstudio, CloudCompare.

Phase I: Quantitative SLAM Benchmarking

We evaluate MASt3R-SLAM on the TUM-RGBD benchmark. Our results show that the system achieves higher precision (0.071m ATE) in 360° rotation scenes due to dense feature matching, compared to the larger-scale Room sequence (0.098m).

ATE (Absolute Trajectory Error): Measures the absolute distance between estimated and ground truth camera poses. A lower ATE indicates higher localization precision.

RMSE (Root Mean Square Error): A standard statistical metric used to represent the overall average error across all frames.

Sim(3) Umeyama Alignment: Since monocular SLAM cannot recover the absolute scale of the world, we perform an optimal alignment (adjusting scale, rotation, and translation) before calculating the error to ensure a fair comparison.

Phase II: Geometric Robustness & Scene Saliency

By comparing three self-collected scenes, we observe that reconstruction quality is highly dependent on Visual Saliency. The pantry yields the densest point cloud due to rich textures, while the blank corner results in extreme sparsity and floaters.

Meeting room 360

2. Meeting room

3. Tea room

4. Corner

Phase III: Rendering Fidelity & Convergence

We analyze the 3DGS training evolution from 7k to 50k iterations. The visual quality reaches a "sweet spot" at 30k, while 50k iterations achieve photorealistic specular details (PSNR > 44dB). The temporal consistency is verified across different frames of the TUM dataset.

Texture Sensitivity: From Pantry to Blank Wall

We evaluate our pipeline across environments with varying texture densities. The Pantry (high texture) yields the most complete geometry due to rich visual features. In contrast, The Blank Wall (texture-less) represents a failure case for monocular SLAM, where lack of salient features leads to sparse points and floating artifacts.

Training Convergence & The "Sweet Spot"

We identify 7,000 iterations as the visual "Sweet Spot" for small-scale indoor scenes. While PSNR continues to climb up to 44.29 dB at 50k iterations, the model begins to overfit, leading to high-frequency artifacts. The 7k-iteration model maintains a more natural structural integrity.

Ablation Study: Impact of SOR Denoising

To bridge the gap between noisy SLAM outputs and high-fidelity rendering, we integrated Statistical Outlier Removal (SOR) as a crucial refinement step. By pruning 7.3% of the raw data (reducing from 1,000k to 927k points) using a threshold of 20 neighbors and 1.25 standard deviations, we effectively eliminated "floaters"—unphysical artifacts that commonly plague 3DGS training. While the quantitative impact on PSNR is minimal (44.37 vs. 44.29 dB), the qualitative enhancement is massive; our SOR-refined pipeline ensures a much cleaner reconstruction and superior visual stability, catering to the human eye's high sensitivity to spatial artifacts.

The PSNR Trap — Overfitting vs. Generalization

We conducted an ablation study comparing two different capture strategies in the same meeting room to understand how data diversity affects 3DGS performance.

The Overfitting Phenomenon (Test 7): Despite achieving a remarkably high PSNR of 49.5 dB, Test 7 (stationary rotation, 11 frames) is a classic case of overfitting. With a very limited field of view, the model simply "memorizes" the training images rather than learning the scene's 3D geometry. This leads to high scores but poor rendering quality from any unseen viewpoints.
Robust Reconstruction (Test 9): In contrast, Test 9 (full walkthrough, 40 frames) yielded a lower PSNR of 42.2 dB. However, because it covers the entire room from multiple angles, it possesses far superior generalization capability. It successfully reconstructs a complete, navigable digital twin.
Conclusion: For practical Novel View Synthesis (NVS), a comprehensive walkthrough is far more valuable than high-PSNR stationary captures. Evaluation should prioritize scene coverage and visual integrity over raw numerical metrics.

Real-time Execution & Hardware Constraints

Interactive Demo

Page updated

Google Sites

Report abuse