Click the card below to follow the “3DCV” public account and select “Star” to receive valuable content as soon as possible.
Source: 3DCV
Add assistant: cv3d008, note: direction + school/company + nickname, to add you to the group. A 3D vision industry subgroup is attached at the end of the article.
Scan the QR code below to join the “3D Vision from Beginner to Master” knowledge community, which contains many practical questions in 3D vision and learning materials for various modules: nearly 20 exclusive video courses, latest top conference papers, computer vision books, high-quality 3D vision algorithm source code, etc. If you want to get started with 3D vision, do projects, or conduct research, feel free to scan the code to join!
0. Paper Information
Title: LGU-SLAM: Learnable Gaussian Uncertainty Matching with Deformable Correlation Sampling for Deep Visual SLAM
Authors: Yucheng Huang, Luping Ji, Hudong Liu, Mao Ye
Institution: University of Electronic Science and Technology of China
Original link: https://arxiv.org/abs/2410.23231
Code link: https://github.com/uestc-nnlab/lgu-slam
1. Introduction
Deep Visual Simultaneous Localization and Mapping (SLAM) techniques, such as DROID, have made significant progress by leveraging dense flow fields on deep visual odometry. Generally, they heavily rely on global visual similarity matching. However, the interference of ambiguous similarity in uncertain regions often leads to excessive noise in correspondences, ultimately misleading SLAM in geometric modeling. To address this issue, we propose a Learnable Gaussian Uncertainty (LGU) matching that focuses primarily on precise correspondence construction. In our scheme, a learnable 2D Gaussian uncertainty model is designed to associate matching frame pairs. It can generate input-dependent Gaussian distributions for each correspondence map. Additionally, a multi-scale deformable correlation sampling strategy is designed to adaptively fine-tune the sampling in each direction through prior lookup ranges, achieving reliable correlation construction. Moreover, a KAN-bias GRU component is employed to improve temporal iterative enhancement, completing complex spatiotemporal modeling with limited parameters. Extensive experiments conducted on real and synthetic datasets validate the effectiveness and superiority of the method.
2. Introduction
Visual SLAM, which is implemented through sensors like cameras, tracks the long-term pose of agents and builds a map of the surrounding scene in a structure-from-motion manner. As an important supporting technology, visual SLAM can greatly promote the development of applications in fields such as autonomous driving, autonomous navigation, and virtual reality.
Modern visual SLAM adopts optimization-based solutions, in which camera poses and geometric map construction are jointly optimized through bundle adjustment. Depending on different correspondence construction methods, optimization-based SLAM can be roughly divided into three categories. The first category is feature matching methods, such as the ORB-SLAM series. In traditional methods, sparse feature matching uses manual feature descriptors (like ORB) to build correspondences. To enhance semantic representation capabilities, related work employs deep learning for feature extraction and matching tasks. These indirect feature matching methods have fast online tracking capabilities but may fail in texture-less areas. Meanwhile, due to the use of sparse feature matching, dense maps cannot be constructed.
The second category is photometric direct methods. These schemes can be feasibly executed in smooth areas, but extremely high resolutions lead to a massive computational load. The third category is optical flow methods. They mine correspondences between two frames by predicting optical flow fields, thereby avoiding feature matching. DEMON and DeepTAM use direct regression to predict optical flow and pose changes, but the lack of prior constraints leads to unsatisfactory localization accuracy. As a milestone of optical flow methods, DROIDSLAM constructs dense optical flow fields using visual correlation volumes and employs a deep dense bundle adjustment (DBA) to iteratively optimize depth and pose estimation. GO-SLAM also uses correlation volumes to compute dense optical flow fields to guide pose optimization in the DBA layer and achieves high-quality dense map construction using NERF. Recommended course: (Second session) Thoroughly understand 3D laser SLAM based on the LOAM framework: from source code analysis to algorithm optimization.
In particular, typical optical flow methods (like DROID) are built upon the RAFT optical flow algorithm. They construct correlation volumes by calculating global visual similarity between input frame pairs and perform multi-scale sampling of the correlation volume using a predetermined range. The potential issues in their design are evident. First, in the global visual similarity calculation process, each feature element in one frame equally attends to all elements in another frame, which does not align with the prior viewpoint that the likelihood of each element appearing in a different position in the next frame should follow a 2D Gaussian distribution. Similarly, BRAFT leverages this prior by assuming that each correspondence map follows a fixed Gaussian distribution, allowing for manual filtering of distant yet highly similar outliers. However, this assumption inevitably diverges from reality. Additionally, BRAFT cannot describe the uncertainty in ambiguous regions. Second, in these optical flow methods, the sampling directions and steps at each scale are fixed. When the variance of visual similarity distribution within the fixed sampling range is too large, it may lead to significant noise in the sampling results.
To address these issues, we propose a novel LGU-SLAM scheme that possesses reliable correspondence construction capabilities. Inspired by 3DGS, we design a learnable Gaussian uncertainty method for robust visual similarity matching. Our design motivation can be explained as follows. First, when calculating the correlation volume, each element in the previous frame has a correspondence map obtained through cross-attention calculations with the next frame. We design a learnable 2D Gaussian uncertainty mask to weight each correspondence map, where the expectation and variance of each 2D Gaussian are input-dependent and predicted through a multi-layer perceptron (MLP). We allow the centers of the 2D Gaussians to move towards the reliable visual similarity regions in the correspondence map under the condition of our designed uncertainty self-supervised loss. To mitigate the negative impact of ambiguous regions in the correlation volume, we employ end-to-end Gaussian parameter adjustments, allowing ambiguous regions to have larger variances, thus providing greater uncertainty and lower attention weights. To realize the learnable 2D Gaussian uncertainty mask, we design the entire computation chain to be differentiable, enabling each 2D Gaussian to be adjusted in a posterior data-driven manner.
Furthermore, inspired by deformable convolutions, we redesign a deformable multi-scale correlation sampling strategy to assist the model in accurately locating correlation volumes. In our strategy, a MLP-based offset decoder is designed to predict the sampling offsets at the top scale by fusing matched frame pairs. A residual decoder is also designed to predict the offset residuals, which are added to the sampling offsets at the top scale and divided by the downsampling steps to obtain the final sampling offsets at the current scale. This allows for regularization using the prior information provided by the top-level offsets to prevent the model from falling into suboptimal optimization. At the same time, we utilize the uncertainty of the visual similarity distribution within the original sampling range to design a filtering mechanism for each predicted offset. Finally, we achieve more reliable temporal modeling under limited parameters by adding a KAN-based bias term to the key component GRU in the temporal iterative model of DROID. By integrating the proposed technical solutions to form a complete LGU-SLAM, robust real-time localization and mapping can be achieved.
3. Performance Showcase
Through orthogonal gains achieved in correspondence construction and temporal iteration, our method reduces the average trajectory error by 39% (ATE: from 0.43 to 0.26) compared to the baseline. On the TartanAir test set, we also demonstrate stronger robustness in complex outdoor scenes compared to the baseline, especially in the “ME001” and “ME004” scenes, where LGU-SLAM exhibits more stable tracking performance and lower trajectory errors, as shown in Figure 7.

4. Main Contributions
The main contributions of our work can be summarized as follows:
(1) We propose a learnable Gaussian uncertainty scheme to adjust the expectation and variance of each 2D Gaussian. It suppresses interference from ambiguous regions through similarity uncertainty.
(2) We design a multi-scale deformable correlation sampling method. It uses MLP to predict the sampling offsets for each scale, enabling the model to obtain reliable context when sampling correlation volumes.
(3) We adopt a KAN-bias GRU to achieve more reliable temporal modeling under limited parameters.
(4) We conduct extensive experiments on four benchmark tests for our LGU-SLAM scheme. These experiments demonstrate the effectiveness and superiority of our scheme.
5. Methodology
In this section, we detail the LGU-SLAM scheme as shown in Figure 1. The overall framework of the scheme is divided into three parts. The first part is deep feature extraction, responsible for reducing resolution while enhancing representation capabilities. The second part is correspondence construction, which utilizes learnable Gaussian uncertainty to achieve high-confidence element-wise matching between frame pairs. Additionally, it employs multi-scale deformable correlation sampling to adaptively adjust the context mining range. Finally, the temporal iterative module enhances complex temporal modeling under limited parameters using KAN-bias GRU.

The global overview of our proposed LGU-SLAM. (1) Deep feature extraction. It utilizes video frames as network input for deep semantic abstraction. (2) Correspondence construction. First, it indexes the feature sequence fl using a bipartite graph to compute the correlation volume and designs an MLP-based decoder to output 2D Gaussians for all correspondence maps to generate Gaussian uncertainty masks, thereby suppressing the visual similarity of outliers. Secondly, it utilizes the proposed multi-scale deformable correlation sampling to enhance the context construction within the input-dependent sampling range. (3) Temporal iterative enhancement. It employs the designed KAN-bias GRU for temporal iterative enhancement, which is combined with dense bundle adjustment (DBA) for optimizing pose and depth information.

To construct multi-scale deformable sampling, first, time fusion is achieved by channel concatenation fk_i and fk_j, and a 2D convolution outputs 2D sampling offsets. As shown in Figure 3, an offset decoder Fofs is implemented for the top scale using convolution layers. The original multi-scale sampling starts from the prior by obtaining the corresponding sampling coordinates for each scale by dividing Pk_ij∈RH×W×2 by the downsampling steps. Although bilinear interpolation can be used to approximate the sampling results of floating-point coordinates, it is still challenging to effectively align with the original scale. Therefore, when constructing multi-scale deformable sampling, we adopt a learnable residual structure during downsampling for offset construction.

6. Experimental Results
The benchmark results of the Visual SLAM challenge: The TartanAir visual SLAM challenge is a synthetic data benchmark proposed by the ECCV 2020 SLAM competition, which contains a large number of complex outdoor and indoor scenes, making the SLAM task highly challenging. We validate LGU-SLAM using its official test splits and compare it with the current state-of-the-art (SOTA) algorithms. As shown in Table I, our algorithm outperforms the baseline method DROID-SLAM on the test benchmark, with an average trajectory error reduction of 0.24. Thanks to its use of sparse patch-based correspondence construction in deep visual odometry, DPVO’s average performance surpasses ours. However, it cannot output dense depth information, making it difficult to achieve high-quality dense mapping for SLAM tasks.
ETH3D-SLAM benchmark results: ETH3D-SLAM is a testing benchmark for RGB-D SLAM algorithms. Testers are required to upload results to the official site for testing, with real-time online performance rankings published. Our LGU-SLAM performs consistently with the baseline methods and achieves excellent performance without needing to fine-tune the model on the ETH3D-SLAM training set, demonstrating good generalization capabilities. Figure 5 shows specific test results, where we achieve the second-highest score, second only to DVI-SLAM. Compared to the current optimal method GO-SLAM, our AUC improves by 7% under the “maximum error = 8cm” standard and by 4% under the “maximum error = 2cm” standard.

EuRoC benchmark results: The EuRoC dataset consists of data collected from real large-scale indoor scenes using drones. We tested monocular visual odometry and monocular visual SLAM on this dataset, with results shown in Table II. It can be seen that our solution outperforms other current optimal methods in monocular visual odometry. Compared to DVI-VO, LGU’s average ATE is reduced by 3%. Simultaneously, in monocular visual SLAM, our average ATE is lower than that of current optimal algorithms such as DPV-SLAM (reduced by 22%).

TUM-RGBD benchmark results: TUM-RGBD is similar to EuRoC, primarily used for small indoor scenes. Referencing previous work, we select the freiburg1 subset as the test set to validate the robustness of our LGU-SLAM. As shown in Table III, our trajectory prediction error is lower than that of current optimal methods like GO-SLAM (approximately reduced by 11%) and DPV-SLAM (approximately reduced by 60%).

7. Conclusion & Future Work
We propose LGU-SLAM, primarily composed of learnable Gaussian uncertainty combined with multi-scale deformable correlation sampling to establish robust correspondences for SLAM. It can improve localization accuracy while ensuring dense mapping. Additionally, we employ dense optical flow fields to associate two frames, enabling SLAM to predict dense depth maps and providing feasible self-supervised upstream inputs for current reconstruction methods. Experiments show that our LGU-SLAM is effective and superior to comparative methods. Its main drawbacks are high memory consumption and low frame rates. In the future, research on lighter LGU-SLAM schemes combined with advanced reconstruction methods is expected to balance pose tracking and high-quality dense mapping.
Readers interested in more experimental results and article details can read the original paper~
This article is for academic sharing only. If there is any infringement, please contact us to delete the article.
A 3D vision exchange group has been established!
We have now established multiple communities in the field of 3D vision, including 2D computer vision, cutting-edge, industrial 3D vision, SLAM, autonomous driving, 3D reconstruction, drones, etc. Subgroups include:
Industrial 3D Vision: camera calibration, stereo matching, 3D point clouds, structured light, robotic arm grasping, defect detection, 6D pose estimation, phase shifting, Halcon, photogrammetry, array cameras, photometric stereo vision, etc.
SLAM: visual SLAM, laser SLAM, semantic SLAM, filtering algorithms, multi-sensor fusion, multi-sensor calibration, dynamic SLAM, MOT SLAM, NeRF SLAM, robot navigation, etc.
Autonomous Driving: depth estimation, Transformer, millimeter wave|laser radar|visual camera sensors, multi-sensor calibration, multi-sensor fusion, comprehensive groups for autonomous driving, 3D object detection, path planning, trajectory prediction, 3D point cloud segmentation, model deployment, lane detection, Occupancy, target tracking, etc.
3D Reconstruction: 3DGS, NeRF, multi-view geometry, OpenMVS, MVSNet, colmap, texture mapping, etc.
Drones: quadrotor modeling, drone flight control, etc.
2D Computer Vision: image classification/segmentation, object detection, medical imaging, GAN, OCR, 2D defect detection, remote sensing mapping, super-resolution, face detection, action recognition, model quantization pruning, transfer learning, human pose estimation, etc.
Cutting-edge: embodied intelligence, large models, Mamba, diffusion models, etc.
In addition to these, there are also job seeking, hardware selection, visual product implementation, products, industry news and other exchange groups.
Add assistant: dddvision, note: research direction + school/company + nickname (e.g. 3D point cloud + Tsinghua + little strawberry), to add you to the group.

「3D Vision from Beginner to Master」 Knowledge Community
「3D Vision from Beginner to Master」 Knowledge Community, has been established for 6 years, and the materials in the community include: nearly 20 exclusive video courses (including structured light 3D reconstruction, camera calibration, SLAM, depth estimation, 3D object detection, 3DGS top conference reading courses, 3D point clouds, etc.), project docking, summary of 3D vision learning routes, latest top conference papers & codes, latest modules in the 3D vision industry, high-quality 3D vision source code collection, book recommendations, programming basics & learning tools, practical projects & assignments, job seeking, interview experiences & questions, etc. Welcome to join the knowledge community of 3D Vision from Beginner to Master to learn and progress together.
Official website: www.3dcver.com
Embodied intelligence, 3DGS, NeRF, structured light, phase shifting, robotic arm grasping, point cloud practice, Open3D, defect detection, BEV perception, Occupancy, Transformer, model deployment, 3D object detection, depth estimation, multi-sensor calibration, planning and control, drone simulation, C++, 3D vision python, dToF, camera calibration, ROS2, robot control planning, LeGo-LAOM, multi-modal fusion SLAM, LOAM-SLAM, indoor and outdoor SLAM, VINS-Fusion, ORB-SLAM3, MVSNet 3D reconstruction, colmap, line and surface structured light, hardware structured light scanners, etc.

3D Vision Module Selection: www.3dcver.com
Click here👇 to follow me, and remember to star it~
One-click triple connection “share”, “like” and “view”
3D vision technology frontier progress is seen every day ~