Visualization of Reconstruction in the context of SLAM.
(The blue in attention visualization indicates the attention mechanism pays more attention on depth fusion priors.)
Learning neural implicit representations has achieved remarkable performance in 3D reconstruction from multi-view images. Current methods use volume rendering to render implicit representations into either RGB or depth images that are supervised by the multi-view ground truth. However, rendering a view each time suffers from incomplete depth at holes and unawareness of occluded structures from the depth supervision, which severely affects the accuracy of geometry inference via volume rendering. To resolve this issue, we propose to learn neural implicit representations from multi-view RGBD images through volume rendering with an attentive depth fusion prior. Our prior allows neural networks to sense coarse 3D structures from the Truncated Signed Distance Function (TSDF) fused from all available depth images for rendering. The TSDF enables accessing the missing depth at holes on one depth image and the occluded parts that are invisible from the current view. By introducing a novel attention mechanism, we allow neural networks to directly use the depth fusion prior with the inferred occupancy as the learned implicit function. Our attention mechanism works with either a one-time fused TSDF that represents a whole scene or an incrementally fused TSDF that represents a partial scene in the context of Simultaneous Localization and Mapping (SLAM). Our evaluations on widely used benchmarks including synthetic and real-world scans show our superiority over the latest neural implicit methods.
Overview of our method. We learn the occupancy function through volume rendering using RGBD images as supervision in the context of SLAM. For each sample query along the shooting rays from the current view, we employ learnable feature grids covering the scene to interpolate its hierarchical geometry features and color feature. For queries inside the bandwidth of a TSDF grid fused from available depth images, we leverage its interpolation from TSDF grid as a prior of coarse occupancy estimation. The prior is attentive by a neural function which determines the occupancy function by combining currently fused geometry and the learned coarse estimation with learned attention weights. Finally, we use the occupancy function and the color function to render color and depth images through volume rendering. With a learned occupancy function, we run the marching cubes to reconstruct a surface.
(Red in error maps indicates large errors.)
@inproceedings{Hu2023LNI-ADFP,
title = {Learning Neural Implicit through Volume Rendering with Attentive Depth Fusion Priors},
author = {Hu, Pengchong and Han, Zhizhong},
booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
year = {2023}
}