VTGaussian-SLAM: RGBD SLAM for Large Scale Scenes with Splatting View-Tied 3D Gaussians

ICML 2025

Machine Perception Lab, Wayne State University, Detroit, USA

VTGaussian-SLAM can reconstruct a 3D scene from an RGBD sequence.

(The points with color are Gaussian centers that are tied with the GT depth images.)

Abstract

Jointly estimating camera poses and mapping scenes from RGBD images is a fundamental task in simultaneous localization and mapping (SLAM). State-of-the-art methods employ 3D Gaussians to represent a scene, and render these Gaussians through splatting for higher efficiency and better rendering. However, these methods cannot scale up to extremely large scenes, due to the inefficient tracking and mapping strategies that need to optimize all 3D Gaussians in the limited GPU memories throughout the training to maintain the geometry and color consistency to previous RGBD observations. To resolve this issue, we propose novel tracking and mapping strategies to work with a novel 3D representation, dubbed view-tied 3D Gaussians, for RGBD SLAM systems. View-tied 3D Gaussians is a kind of simplified Gaussians, which is tied to depth pixels, without needing to learn locations, rotations, and multi-dimensional variances. Tying Gaussians to views not only significantly saves storage but also allows us to employ many more Gaussians to represent local details in the limited GPU memory. Moreover, our strategies remove the need of maintaining all Gaussians learnable throughout the training, while improving rendering quality, and tracking accuracy. We justify the effectiveness of these designs, and report better performance over the latest methods on the widely used benchmarks in terms of rendering and tracking accuracy and scalability.

Method

Overview of our method. We organize view-tied 3D Gaussians from several consecutive frames as a section, allowing us to retain as many Gaussians as the GPU memory to represent a local area. This enables us to access these Gaussians more efficiently and, more importantly, facilitates the more robust completion of missing depth by utilizing the depth information from neighboring frames. In each section, we mark the first frame as a head to differentiate it from regular frames for different Gaussian initialization strategies in mapping. For tracking the latest frame, we select Gaussians in a section, render them from the camera pose initialized by the constant speed assumption, and optimize the pose by minimizing rendering errors to the latest frame. If the latest frame is the head of a new section, as shown in (a), we select the Gaussians in a certain section in front according to the visibility. If the latest frame is not a head but a regular frame in the current section, as shown in (c), we select the Gaussians in this section for renderings with higher quality. For mapping the scene using the latest frame, if the latest frame is the head of a new section, as shown in (b), we initialize Gaussians by centering them at all pixels with valid depth values. Suppose the latest frame is not a head but a regular frame in an existing section, as shown in (d). In that case, we only initialize Gaussians as a complement in areas where pixels have valid depth values and the existing Gaussians in the current section cannot cover.

Video

Visual Comparisons

Ours
NICE-SLAM
Ours
Gaussian-SLAM
Ours
SplaTAM
Ours
Ground Truth

Acknowledgement

This project was partially supported by an NVIDIA academic award and a Richard Barber research award.

BibTeX

@InProceedings{Hu2025VTGSSLAM,
                title = {VTGaussian-SLAM: RGBD SLAM for Large Scale Scenes with Splatting View-Tied 3D Gaussians},
                author = {Hu, Pengchong and Han, Zhizhong},
                booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
                year = {2025}
                }