Comprehensive Review of Visual SLAM: Multi-Sensor, Pose Estimation, Dynamic Environments, and Visual Odometry

Click on the “3D Vision Workshop” above and select “Star”

Delivering valuable content promptly

Author丨Automotive Person

Source丨 Heart of Autonomous Driving

Abstract

In recent years, vision-based sensors have demonstrated significant performance, accuracy, and efficiency improvements in SLAM systems. In this regard, visual SLAM (VSLAM) methods refer to SLAM techniques that use cameras for pose estimation and map generation. Numerous research works have shown that VSLAM outperforms traditional methods, which rely solely on specific sensors such as LiDAR, even at lower costs. VSLAM utilizes different types of cameras (e.g., monocular, stereo, and RGB-D), tested across various datasets (e.g., KITTI, TUM RGB-D, and EuRoC) and environments (e.g., indoor and outdoor), employing various algorithms and methodologies to better interpret the environment. These changes have garnered extensive attention from researchers, producing numerous classic VSLAM algorithms. The primary aim of this paper is to present the latest advancements in VSLAM systems and discuss existing challenges and future trends. The paper conducts an in-depth survey of 45 influential papers published in the VSLAM domain, categorizing these methods based on different characteristics, including novelty domain, objectives, algorithms used, and semantic levels. Finally, the paper discusses current trends and future directions, aiding researchers in their studies.

Comprehensive Review of Visual SLAM: Multi-Sensor, Pose Estimation, Dynamic Environments, and Visual Odometry

In summary, Figure 1 illustrates the overall architecture of standard VSLAM methods. The system’s inputs can integrate with data from other sensors to provide additional information, such as Inertial Measurement Units (IMUs) and LiDAR, rather than relying solely on visual data. Furthermore, for direct or indirect methods used in the VSLAM paradigm, the functionality of the visual feature processing module may be modified or neglected. For instance, the “feature processing” stage is only utilized for indirect methods. Another factor is the use of specific modules, such as loop detection and bundle adjustment, to improve execution.

Development of Visual SLAM Algorithms

The VSLAM system has matured over the past few years, with some frameworks playing crucial roles in this development process. Figure 2 showcases milestone algorithms in the evolution of visual SLAM.

The first real-time monocular VSLAM was proposed by Davison in 2007, named the Mono SLAM framework. Their indirect framework could use Extended Kalman Filter (EKF) algorithms to estimate camera motion and 3D elements in the real world. Despite lacking global optimization and loop detection modules, Mono SLAM began to play a significant role in the VSLAM domain. However, the maps reconstructed with this method only included landmarks, lacking further details about the area. Klein et al. proposed Parallel Tracking and Mapping (PTAM) in the same year, dividing the entire VSLAM system into two main threads: tracking and mapping. PTAM laid the foundation for many subsequent works. The main idea of the PTAM method is to reduce computational costs and achieve real-time performance using parallel processing. While tracking estimates camera motion in real-time, mapping predicts the 3D positions of feature points. PTAM was also the first to utilize bundle adjustment (BA) for joint optimization of camera poses and 3D map creation. It employed the Features from Accelerated Segment Test (FAST) corner detector algorithm for keypoint matching and tracking. Although this algorithm outperformed Mono SLAM, its design was complex and required user input in the first stage. Newcombe et al. introduced a direct method for measuring depth values and motion parameters to construct maps in 2011, known as Dense Tracking and Mapping (DTAM). DTAM is a real-time framework for dense mapping and dense tracking modules, determining camera poses by aligning entire frames with given depth maps. To construct environmental maps, the aforementioned stages estimate scene depth and motion parameters separately. While DTAM can provide detailed maps, real-time execution requires high computational costs. As another indirect method in the field of 3D mapping and pixel-based optimization, Endres et al. proposed a method applicable to RGB-D cameras in 2013. Their method is real-time, focusing on low-cost embedded systems and small robots, but fails to produce accurate results in featureless or challenging scenes. The same year, Salas Moreno et al. proposed SLAM++, a pioneering work that utilizes semantic information in real-time SLAM frameworks. SLAM++ adopts RGB-D sensor outputs and performs 3D camera pose estimation and tracking to form a pose graph. It then optimizes predicted poses by merging relative 3D poses obtained from semantic targets in the scene.

As VSLAM baselines matured, researchers focused on improving the performance and accuracy of these systems. Forster et al. proposed a hybrid VO method in 2014 called Semi-direct Visual Odometry (SVO). SVO can combine feature-based methods and direct methods to achieve sensor motion estimation and mapping tasks. SVO can work with both monocular and stereo cameras and is equipped with a pose refinement module to minimize reprojection errors. However, the main drawback of SVO is its reliance on short-term data association and inability to perform loop detection and global optimization. LSD-SLAM is another influential VSLAM method proposed by Engel et al. in 2014, incorporating tracking, depth estimation, and map optimization. This method can reconstruct large-scale maps using its pose graph estimation module and features global optimization and loop detection capabilities. LSD-SLAM’s weakness lies in its initialization phase, requiring all points in a plane, making it a computationally intensive method. Mur Artal et al. introduced two precise indirect VSLAM methods that have received significant attention: ORB-SLAM and ORB-SLAM 2.0. These methods can complete localization and mapping in well-textured sequences and achieve high-performance location recognition using Oriented FAST and Rotated BRIEF (ORB) features. The first version of ORB-SLAM was able to compute camera positions and environmental structures using keyframes collected from camera positions. The second version is an extension of ORB-SLAM, featuring three parallel threads, including tracking for finding feature correspondences, local mapping for map management operations, and loop detection for detecting new loops and correcting drift errors. Although ORB-SLAM 2.0 can work with monocular and stereo cameras, it cannot be used for autonomous navigation due to the reconstruction of maps with unknown scales. Another drawback of this method is its inability to work in textureless areas or environments with repetitive patterns. The latest version of this framework, ORB-SLAM 3.0, was proposed in 2021. It is compatible with various camera types, such as monocular, RGB-D, and stereo vision, providing improved pose estimation outputs.

In recent years, with the rapid development of deep learning, CNN-based methods have addressed many issues by providing higher recognition and matching rates. Similarly, replacing manually designed features with learned features has been proposed as a solution in many recent deep learning-based methods. In this regard, Tateno et al. proposed a CNN-based method that processes input frames for camera pose estimation and uses keyframes for depth prediction, named CNN-SLAM. One of the core ideas of CNN-SLAM to achieve parallel processing and real-time performance is to segment camera frames into smaller parts for better environmental understanding. Engel et al. also introduced Direct Sparse Odometry (DSO), which combines direct methods and sparse reconstruction to extract the highest intensity points in image blocks.

In summary, the milestones in the evolution of VSLAM systems indicate that recent methods focus on the parallel execution of multiple dedicated modules. These modules form a general technology and framework compatible with a wide range of sensors and environments. Such features enable them to perform in real-time and be more flexible in performance improvements.

Related Reviews

There have been numerous reviews in the VSLAM field that comprehensively survey various existing methods. Each paper reviews the main advantages and disadvantages of using VSLAM methods. Macario Barros et al. categorized methods into three different categories: purely visual (monocular), visual-inertial (stereo), and RGB-D. They also proposed various criteria for simplifying the analysis of VSLAM algorithms. However, they did not include other visual sensors, such as event-based sensors. Chen et al. surveyed a wide range of traditional and semantic VSLAM. They divided the SLAM development era into classical, algorithm analysis, and robust perception stages, summarizing classical frameworks that adopt direct/indirect methods and studying the impact of deep learning algorithms on semantic segmentation. Although their work provides a comprehensive study of advanced solutions in the field, the classification of methods is limited to the types of features used in feature-based VSLAM. Jia et al. surveyed a large volume of literature and briefly compared graph optimization-based methods with those equipped with deep learning. In another work, Abaspur Kazerouni et al. covered various VSLAM methods, utilizing sensory devices, datasets, and modules, and simulated several indirect methods for comparison and analysis. They only analyzed feature-based algorithms, such as HOG, Scale-Invariant Feature Transform (SIFT), Speeded-Up Robust Features (SURF), and deep learning-based solutions. Bavle et al. analyzed the situational awareness aspects in various SLAM and VSLAM applications and discussed their missing points. There are also other reviews like [15], [36], [37], [32], [37], which are not elaborated here.

Unlike the aforementioned reviews, this paper comprehensively surveys VSLAM systems across different scenarios, with the main contributions as follows:

Classification of various recent VSLAM methods, involving the main contributions, criteria, and objectives of researchers in proposing new solutions;
Analysis of current trends in VSLAM systems by in-depth exploration of different aspects of various methods;
Introduction of the potential contributions of VSLAM to researchers.

VSLAM Setting Standards

Considering various VSLAM methods, the paper categorizes the available different setups and configurations into the following categories: sensors and data acquisition, target environments, visual feature processing, system evaluation, and semantic categories, which will be introduced one by one below.

Sensors and Data Acquisition

The early stages of the VSLAM algorithm introduced by Davison equipped a monocular camera for trajectory recovery. Monocular cameras are the most common visual sensors used for various tasks, such as object detection and tracking. On the other hand, stereo cameras contain two or more image sensors, enabling them to perceive depth in images, thus achieving more accurate performance in VSLAM applications. Camera setups are cost-effective and provide information perception for higher accuracy requirements. RGB-D cameras are also visual sensors used in VSLAM, providing depth and color information in scenes. The aforementioned visual sensors can provide rich environmental information, such as appropriate lighting and motion speed, but they often struggle in low illumination or high dynamic range scenarios.

In recent years, event cameras have also been used in various VSLAM applications. These low-latency biomimetic visual sensors produce pixel-level brightness changes when motion is detected, rather than standard intensity frames, allowing for high dynamic range output without motion blur effects. Compared to standard cameras, event sensors can provide reliable visual information in high-speed motion and large dynamic range scenes, but they may not provide sufficient information when motion speed is low. On the other hand, event cameras primarily output asynchronous information about the environment. This makes it difficult for traditional visual algorithms to process the outputs from these sensors. Additionally, using spatiotemporal windows of events along with data from other sensors can provide rich pose estimation and tracking information.

Moreover, some methods utilize multi-camera setups to address common issues when working in real environments and improve localization accuracy. Utilizing multi-camera sensors helps to tackle complex problems such as occlusion, camouflage, sensor failure, or sparse trackable texture, providing overlapping views for the cameras. Although multi-camera setups can resolve some data acquisition issues, purely visual VSLAM may face various challenges, such as motion blur when encountering fast-moving targets, feature mismatches in low or high illumination, and neglecting dynamic targets in rapidly changing scenes. Therefore, some VSLAM applications may equip multiple sensors alongside the camera. Integrating events and standard frames or incorporating other sensors, such as LiDAR and IMUs, into VSLAM are some existing solutions.

Target Environment

As a strong assumption in many traditional VSLAM practices, robots operate in static worlds without sudden or unexpected changes. Therefore, although many systems can be successfully applied in specific environments, some unexpected changes in the environment (e.g., the presence of moving targets) can complicate the system and significantly degrade state estimation quality. Systems working in dynamic environments often employ algorithms such as optical flow or Random Sample Consensus (RANSAC) to detect movements in the scene, classifying moving targets as outliers and skipping them during map reconstruction. Such systems leverage geometric/semantic information or attempt to improve localization schemes by combining both results.

Additionally, as a general classification, the paper divides environments into indoor and outdoor categories. Outdoor environments can be urban areas with structured landmarks and large-scale motion changes (e.g., building and road textures) or off-road areas with weak motion states (e.g., moving clouds and vegetation, sandy textures, etc.), increasing the risk of localization and loop detection failures. On the other hand, indoor environments contain scenes with entirely different global spatial properties, such as corridors, walls, and rooms. The paper posits that while VSLAM systems may perform well in one of the aforementioned areas, they may not exhibit the same performance in other environments.

Visual Feature Processing

As previously mentioned, detecting visual features and utilizing feature descriptor information for pose estimation is a necessary stage in indirect VSLAM methods. These methods use various feature extraction algorithms to better understand the environment and track feature points across consecutive frames. There are many feature extraction algorithms, including SIFT, SURF, FAST, BRIEF, ORB, etc. Among them, ORB features have the advantage of fast extraction and matching without significantly sacrificing accuracy compared to SIFT and SURF.

Some methods mentioned above struggle to adapt effectively to various complex and unforeseen situations. Consequently, many researchers utilize CNNs to extract image features, including VO, pose estimation, and loop detection. Depending on the functionality of the methods, these techniques can represent supervised or unsupervised frameworks.

System Evaluation

While some VSLAM methods, particularly those capable of functioning in dynamic and challenging environments, are tested in real-world scenarios, many research works have utilized public datasets to demonstrate their applicability. In this regard, the RAWSEEDS dataset by Bonarini et al. is a well-known multi-sensor benchmark tool containing indoor, outdoor, and mixed robotic trajectories with ground truth data. It is one of the oldest public benchmark tools for robotics and SLAM purposes. The Scenenet RGB-D dataset by McCormac et al. is another popular dataset for scene understanding problems, such as semantic segmentation and object detection, containing 5 million large-scale rendered RGB-D images. Many recent works in the VSLAM and VO fields have tested their methods on the TUM RGB-D dataset. Additionally, the NTU VIRAL dataset by Nguyen et al. contains data collected by drones equipped with 3D LiDAR, cameras, IMUs, and multiple ultra-wideband (UWB) sensors. This dataset includes indoor and outdoor instances, aiming to evaluate autonomous driving and aerial operation performance. Other datasets such as EuRoC MAV, OpenLORIS Scene, KITTI, TartanAir, ICL-NUIM, and event camera-based datasets can be referenced in related papers.

Depending on the sensor setups, applications, and target environments, the aforementioned datasets are used for various VSLAM methods. These datasets mainly contain the intrinsic and extrinsic parameters of the cameras and GT. Table I and Figure 3 summarize the characteristics of the datasets and present some instances of each dataset.

Semantic Level

Robots require semantic information to understand the surrounding scene and make better decisions. In many recent VSLAM works, adding semantic-level information to geometry-based data has outperformed purely geometric methods, enabling them to provide conceptual knowledge of the environment. In this regard, pre-trained object recognition modules can incorporate semantic information into VSLAM models. One of the latest methods is the use of CNNs in VSLAM applications. Generally, semantic VSLAM methods consist of the following four main components:

Tracking Module: It uses 2D feature points extracted from consecutive video frames to estimate camera poses and build 3D map points. The computation of camera poses and the construction of 3D map points establish the baseline for localization and mapping processes;
Local Mapping Module: By processing two consecutive video frames, a new 3D map point is created, which is used alongside the BA module to improve camera poses;
Loop Module: By comparing keyframes with extracted visual features and assessing their similarities, it further adjusts camera poses and optimizes the constructed map;
Non-Rigid Context Culling (NRCC): The main goal of using NRCC is to filter out temporal targets from video frames to reduce their adverse impact on localization and mapping stages. It primarily involves a segmentation process to separate various unstable instances in frames, such as people. Since NRCC can reduce the number of feature points to be processed, it simplifies the computational part and achieves more robust performance.

Therefore, leveraging semantic information in VSLAM methods can improve the uncertainty in pose estimation and map reconstruction. However, the current challenge is how to correctly utilize the extracted semantic information without affecting computational costs.

VSLAM Methods Based on Main Objectives

Objective 1: Multi-Sensor Processing

This category encompasses VSLAM methods that utilize various sensors to better understand the environment. While some techniques rely solely on cameras as the visual sensors used, others combine various sensors to enhance algorithm accuracy.

1) Using Multi-Cameras

Reconstructing the 3D trajectory of moving objects with a single camera can be challenging, prompting some researchers to suggest using multiple cameras. For example, CoSLAM is a VSLAM system introduced by Zou and Tan that employs single cameras deployed on different platforms to robustly reconstruct maps. CoSLAM combines multiple cameras that move independently in dynamic environments, reconstructing maps based on their overlapping fields of view. The process simplifies the reconstruction of dynamic points in 3D by mixing camera-internal and inter-camera pose estimation and mapping. CoSLAM employs the Kanade-Lucas-Tomasi (KLT) algorithm to track visual features and operates in both static and dynamic environments, including indoor and outdoor settings where relative positions and orientations may change over time. The main drawback of this method is the requirement for complex hardware to interpret a large volume of camera outputs, increasing computational costs by adding more cameras.

For challenging outdoor scenarios, Yang et al. developed a collaborative panoramic visual VSLAM method using multiple cameras. This method endows each camera with independence to enhance the VSLAM system’s performance in challenging scenes, such as occlusion and sparse textures. To determine matching ranges, they extract ORB features from the overlapping fields of view of the cameras. Additionally, they employed CNN-based deep learning techniques to identify similar features for loop detection. In their experiments, the authors utilized datasets generated by panoramic cameras and integrated navigation systems. Other related works include MultiCol-SLAM.

2) Using Multi-Sensors

Other methods suggest fusing multiple sensors and using outputs from vision and inertial sensors for better performance. In this regard, Zhu et al. proposed a low-cost indirect LiDAR-assisted VSLAM called CamVox, demonstrating its reliable performance and accuracy. Their method combines the unique features provided by Livox LiDAR as an advanced depth sensor with the outputs of RGB-D cameras using ORB-SLAM 2.0. The authors use IMUs to synchronize and correct non-repetitive scanning positions. The contribution of CamVox is the proposal of an autonomous LiDAR-camera calibration method for operation in uncontrolled environments. Empirical tests on robotic platforms showed that CamVox can run in real-time.

VIRAL (Visual-Inertial-Range-LiDAR) SLAM, proposed by [67], is a multimodal system that couples cameras, LiDAR, IMUs, and UWB. It introduces a visual feature map matching edge marginalization scheme based on LiDAR point cloud constructed local maps. The BRIEF algorithm is used to extract and track visual components. The framework also includes synchronization schemes and triggers for the used sensors. VIRAL tested their method on the NTU VIRAL dataset, which contains data captured by cameras, LiDAR, IMUs, and UWB sensors. However, due to processing synchronization, multithreading, and sensor conflict resolution, their method has high computational demands. Other related algorithms include Ultimate SLAM.

Objective 2: Pose Estimation

This class of methods focuses on how to improve pose estimation in VSLAM methods using various algorithms.

1) Using Line/Point Data

In this regard, Zhou et al. suggested using structural line segments as useful features to determine camera poses. Structural lines are associated with dominant directions and encode global directional information, thus improving predicted trajectories. The method is named StructSLAM, a 6-DoF VSLAM technique capable of operating under low-feature and featureless conditions.

Point and Line SLAM (PL-SLAM) is a VSLAM system based on ORB-SLAM, optimized for non-dynamic low-texture scenes, proposed by Pumarola et al. This system simultaneously fuses line and point features to improve pose estimation and assist in operating under low feature points. The authors tested PL-SLAM on generated datasets and TUM RGB-D. The drawbacks of their method include computational costs and the necessity to utilize other geometric primitives (e.g., planes) to achieve more robust accuracy.

Gomez-Ojeda et al. introduced PL-SLAM (distinct from the similarly named framework by Pumarola et al.), an indirect VSLAM technique using points and lines in stereo visual cameras to reconstruct unseen maps. They merge segments obtained from points and lines across all VSLAM modules with visual information acquired from consecutive frames in their method. Using ORB and line segment detection algorithms (LSD), PL-SLAM retrieves and tracks points and line segments in subsequent stereo frames. The authors tested PL-SLAM on the EuRoC and KITTI datasets, potentially outperforming the stereo version of ORB-SLAM 2.0 in performance. One of the main drawbacks of PL-SLAM is the computational time required by the feature tracking module and the need to consider all structural lines to extract information about the environment. Other related algorithms can be referenced in the papers.

2) Using Additional Features

[74] proposed Dual Quaternion Visual SLAM (DQV-SLAM), a framework for stereo visual cameras that utilizes extensive Bayesian frameworks for 6-DoF pose estimation. To prevent the linearization of nonlinear spatial transformation groups, their method employs progressive Bayesian updates. For the point cloud and optical flow of the map, DQV-SLAM uses ORB features to achieve reliable data association in dynamic environments. The method can yield reliable predictions on the KITTI and EuRoC datasets. However, it lacks a probabilistic interpretation of pose random modeling and has high computational demands for sampling-based filtering. Other related algorithms can be referenced in the papers.

3) Deep Learning

Bruno and Colombini proposed LIFT-SLAM, which combines deep learning-based feature descriptors with traditional geometry-based systems. It extends the pipeline of the ORB-SLAM system, using CNNs to extract features from images and providing denser and more accurate matches based on learned features. For detection, description, and orientation estimation, LIFT-SLAM fine-tunes the Learning Invariant Feature Transform (LIFT) deep neural network. Research conducted using indoor and outdoor instances from the KITTI and EuRoC MAV datasets indicates that LIFT-SLAM outperforms traditional feature-based and deep learning-based VSLAM systems in accuracy. However, the method’s drawbacks include its computationally intensive pipeline and unoptimized CNN design.

Naveed et al. proposed a deep learning-based VSLAM solution that features reliable and consistent modules, even on extremely curved routes. Their method outperformed several VSLAM systems and utilized deep reinforcement learning networks trained on real simulators. Additionally, they provided a baseline for active VSLAM evaluation and could be suitably generalized in real indoor and outdoor environments. The network’s path planner developed ideal path data, which was received by its underlying ORB-SLAM system. They created a dataset containing actual navigation events in challenging and textureless environments for evaluation. Other methods can be referenced in the papers.

Objective 3: Real-World Feasibility

This class of methods aims to be used in various environments and operate in multiple scenarios. The paper notes that citations in this section are highly integrated with semantic information extracted from the environment and demonstrate end-to-end VSLAM applications.

1) Dynamic Environments

In this regard, Yu et al. introduced a VSLAM system called DS-SLAM, which can be used in dynamic contexts and provides semantic-level information for map construction. This system is based on ORB-SLAM 2.0 and includes five threads: tracking, semantic segmentation, local mapping, loop closure, and dense semantic map construction. To exclude dynamic targets and enhance localization accuracy before the pose estimation process, DS-SLAM uses the optical flow algorithm of the real-time semantic segmentation network SegNet. DS-SLAM has been tested in real-world environments, RGB-D cameras, and the TUM RGB-D dataset. However, despite its high localization accuracy, it still faces limitations in semantic segmentation and computationally intensive features.

Semantic Optical Flow SLAM (SOF-SLAM) is an indirect VSLAM system constructed based on the RGB-D mode of ORB-SLAM 2.0. Their method utilizes a semantic optical flow dynamic feature detection module, which extracts and skips hidden change features in the semantic and geometric information provided by ORB feature extraction. To provide accurate camera poses and environmental information, SOF-SLAM employs a pixel-level semantic segmentation module using SegNet. Experimental results in extreme dynamic conditions on the TUM RGB-D dataset and real environments indicate that SOF-SLAM outperforms ORB-SLAM 2.0. However, SOF-SLAM’s drawbacks include ineffective identification of non-static features and reliance on only two consecutive frames. Other related algorithms can be referenced in the papers.

2) Deep Learning-Based Solutions

In another work by Li et al., a deep learning approach named DXSLAM was developed to find keypoints similar to SuperPoints and generate universal descriptors and keypoints of images. They trained an advanced CNN, HF-NET, to generate frame-based and keypoint descriptors by extracting local and global information from each frame. They also trained a visual vocabulary for local features using an offline Bag of Words (BoW) method to achieve precise loop recognition. DXSLAM runs in real-time without using GPUs and is compatible with contemporary CPUs. Even though these qualities were not specifically addressed, it exhibits strong resistance to dynamic changes in dynamic environments. DXSLAM has been tested on the TUM RGB-D and OpenLORIS scene datasets and indoor and outdoor images, achieving more accurate results than ORB-SLAM 2.0 and DS-SLAM. However, the main drawback of this method is its complex feature extraction architecture and the integration of deep features into the old SLAM framework.

In another approach, Li et al. developed a real-time VSLAM technique for extracting feature points based on deep learning in complex situations. This method can run on GPUs and supports the creation of 3D dense maps, functioning as a multi-task feature extraction CNN with self-supervised capabilities. The CNN output is a fixed-length binary code string of 256, allowing it to replace more traditional feature point detectors, such as ORB. The system comprises three threads for achieving reliable and timely performance in dynamic scenes: tracking, local mapping, and loop closure. The system supports both monocular and RGB-D cameras based on ORB-SLAM 2.0 as the baseline. Other related algorithms can be referenced in the papers.

3) Using Artificial Landmarks

A technique proposed by Medina Carnicer et al. called UcoSLAM combines natural and artificial landmarks to automatically compute the scale of the surrounding environment, outperforming traditional VSLAM systems. UcoSLAM’s main drive is to counteract the instability, repetitiveness, and poor tracking quality of natural landmarks. It can operate in environments without labels or features, as it can function only in keypoint, landmark, and mixed modes. To locate map correspondences, optimize reprojection errors, and re-localize during tracking failures, UcoSLAM has a tracking mode. Additionally, it features a landmark-based loop detection system that can utilize any descriptor to describe features, including ORB and FAST. Despite the many advantages of UcoSLAM, the system executes in multi-threading, making it a time-consuming method.

4) Broad Settings

Another VSLAM strategy for dynamic indoor and outdoor environments is DMS-SLAM, which supports monocular, stereo, and RGB-D visual sensors. This system employs sliding window and Grid-based Motion Statistics (GMS) feature matching methods to find static feature locations. DMS-SLAM is based on the ORB-SLAM 2.0 system, tracking static features identified by the ORB algorithm. The authors tested their proposed method on the TUM RGB-D and KITTI datasets, outperforming advanced VSLAM algorithms. Additionally, by removing feature points on dynamic targets during the tracking step, DMS-SLAM performs faster than the original ORB-SLAM 2.0. However, DMS-SLAM encounters difficulties in textureless, fast motion, and highly dynamic environments.

Objective 4: Resource Constraints

In another class, some VSLAM methods are designed for devices with limited computational resources compared to other standard devices. For instance, VSLAM systems designed for mobile devices and robots with embedded systems fall under this category.

1) Devices with Limited Processing Power

In this regard, edgeSLAM is a real-time, edge-assisted semantic VSLAM system proposed by Xu et al. for mobile and resource-constrained devices. It employs a series of fine-grained modules used by edge servers and related mobile devices without requiring multithreading. EdgeSLAM also includes a semantic segmentation module based on Mask-RCNN technology to improve segmentation and object tracking. The authors installed some commercial mobile devices, such as phones and development boards, on an edge server. By reusing the results of target segmentation, they adapt the system parameters to different network bandwidth and latency situations to avoid redundant processing. EdgeSLAM has been evaluated on instances of monocular vision from TUM RGB-D and KITTI datasets and datasets created for experimental setups.

For stereo camera setups, Grisetti et al. proposed a lightweight feature-based VSLAM framework named ProSLAM, whose results are comparable to advanced techniques. Their method consists of four modules: a triangulation module that creates 3D points and related feature descriptors; an incremental motion estimation module that processes two frames to determine the current position; a map management module that creates local maps; and a re-localization module that updates the world map based on local maps. ProSLAM retrieves the 3D positions of points using a single thread and utilizes a small known library to create a simple system. Experiments based on the KITTI and EuRoC datasets indicate that their method can achieve robust results. However, it exhibits deficiencies in rotation estimation and lacks any bundle adjustment modules. Other related algorithms can be referenced in the papers.

2) Computational Migration

Ben Ali et al. suggest using edge computing to migrate resource-intensive operations to the cloud, reducing the computational burden on robots. They modified the architecture of ORB-SLAM 2.0 in their indirect framework Edge-SLAM, maintaining the tracking module on the robot while delegating the remaining parts to the edge. By splitting the VSLAM pipeline between the robot and edge devices, the system can maintain local and global maps. They can still operate correctly under limited resources without sacrificing accuracy. However, one of the drawbacks of this method is the complexity of the architecture due to the decoupling of various SLAM modules. Another issue is that the system only operates in short-term setups, and using Edge SLAM in long-term scenarios (e.g., over several days) may face performance degradation.

Objective 5: Versatility

Work in this category focuses on direct development, utilization, adaptation, and expansion of VSLAM.

In this regard, Sumikura et al. introduced OpenVSLAM, a highly adaptable open-source VSLAM framework aimed at rapid development and invocation by other third-party programs. Their feature-based method is compatible with various camera types, including monocular, stereo, and RGB-D, and can store or reuse reconstructed maps for future use. Due to its robust ORB feature extraction module, OpenVSLAM outperforms ORB-SLAM and ORB-SLAM 2.0 in tracking accuracy and efficiency. However, due to concerns that code similarity infringes on ORB-SLAM 2.0’s rights, the open-source code for this system has been halted.

To bridge the gap between real-time capabilities, accuracy, and versatility, Ferrera et al. developed OV2SLAM, which is suitable for both monocular and stereo visual cameras. By limiting feature extraction to keyframes and monitoring it in subsequent frames by eliminating metering errors, their method reduces computational load. In this sense, OV2SLAM is a hybrid strategy that combines the advantages of direct and indirect VSLAM methods. In indoor and outdoor experiments using well-known benchmark datasets, including EuRoC, KITTI, and TartanAir, OV2SLAM has been shown to outperform several popular techniques. Other related algorithms can be referenced in the papers.

Objective 6: Visual Odometry

This class of methods aims to determine the robot’s position and orientation with the highest possible accuracy.

1) Deep Neural Networks

In this regard, the Dynamic-SLAM framework proposed in [100] utilizes deep learning for accurate pose prediction and appropriate environmental understanding. As part of the semantic-level module to optimize VO, the authors use CNNs to identify moving targets in the environment, aiding them in reducing pose estimation errors caused by incorrect feature matching. Additionally, Dynamic-SLAM employs a selective tracking module to ignore dynamic locations in the scene and uses a missing feature correction algorithm to achieve velocity invariance in adjacent frames. Although the results are promising, the system requires significant computational costs due to the limited number of defined semantic classes and risks misclassifying dynamic/static targets.

Bloesch et al. proposed Code-SLAM, which provides a condensed and dense representation of scene geometry. Their VSLAM system is an enhancement of PTAM, working solely with monocular cameras. It partitions intensity images into convolutional features and feeds them into a deep autoencoder trained on intensity images from the SceneNet RGB-D dataset. Experimental results on the EuRoC dataset indicate promising results in terms of accuracy and performance. Other related algorithms can be referenced in the papers.

2) Processing of Deep Adjacent Frames

In another work, the authors developed a real-time dense SLAM method for RGB-D cameras that improves their previous method by reducing photometric and geometric errors between two images for camera motion detection. They expanded the keyframe-based solution Pose SLAM, which retains only non-redundant poses to generate compact maps, increasing dense visual odometry features and efficiently utilizing information from camera frames for reliable camera motion estimation. The authors also employed an entropy-based technique to measure the similarity of keyframes for loop detection and drift avoidance. However, their method still requires work on the quality of loop detection and keyframe selection.

In another work introduced by Li et al., a feature-based VSLAM method called DP-SLAM achieves real-time dynamic target removal. This method employs a Bayesian probability propagation model based on the likelihood of keypoints derived from moving targets. Using moving probability propagation algorithms and iterative probability updates, DP-SLAM can overcome geometric constraints and changes in semantic data. It integrates with ORB-SLAM 2.0 and has been tested on the TUM RGB-D dataset. Despite accurate results, the system only operates in sparse VSLAM due to the iterative probability update module and faces high computational costs. Other related algorithms can be referenced in the papers.

3) Various Feature Processing

Another method in this category is TextSLAM, a VSLAM system based on text proposed by Li et al. It merges text items retrieved from scenes using FAST corner detection techniques into the SLAM pipeline. Texts include various textures, patterns, and semantics, making it more effective to create high-quality 3D text maps using them. TextSLAM uses text as reliable visual baseline markers, parameterizing them after finding the text in the first frame, then projecting 3D text targets onto the target image for re-localization. They also propose a new three-variable parameterization technique for initializing instantaneous text features. Experiments conducted in indoor and outdoor environments using a monocular camera and a dataset created by the authors yielded highly accurate results. The three main challenges of TextSLAM are operating in textless environments, interpreting short letters, and the need to store a large text dictionary. Other related algorithms can be referenced in the papers.

Identifying Current Trends

Statistics

Regarding the categorization of the aforementioned survey papers, the paper visualizes the processed data in Figure 4 to identify current trends in VSLAM. In subplot “a,” it can be seen that most proposed VSLAM systems are standalone applications that perform the entire process of localization and mapping from scratch using visual sensors. While ORB-SLAM2.0 and ORB-SLAM serve as foundational platforms for constructing new frameworks, very few methods are based on other VSLAM systems like PTAM and PoseSLAM. Furthermore, concerning the objectives of VSLAM applications, subplot “b” indicates that the most important goal is to improve the visual odometry module. Thus, most recent VSLAM efforts aim to address current algorithms’ issues in determining the robot’s position and orientation. Pose estimation and real-world survivability are further fundamental goals of new VSLAM papers. Regarding the datasets used for evaluation in the surveyed papers, subplot “c” illustrates that most works tested on the TUM RGB-D dataset. Additionally, many researchers tend to experiment on their generated datasets. We can assume that the primary motivation for generating datasets is to showcase how VSLAM methods work in real scenarios and whether they can be used as end-to-end applications. EuRoC MAV and KITTI are the next popular evaluation datasets in VSLAM work. Another interesting piece of information extracted from subplot “d” involves the impact of using semantic data when utilizing VSLAM systems. We can observe that most papers do not include semantic data when handling environments. The paper hypothesizes that the reason for not using semantic data is:

In many cases, the computational cost of training models to recognize targets and use them for semantic segmentation is quite high, which may increase processing time;
Most geometry-based VSLAM algorithms are designed to work on plug-and-play devices, allowing them to localize and map using camera data with minimal effort;
Incorrect information extracted from the scene can also introduce more noise into the process.

When considering the environment, we can see in subplot “e” that over half of the methods can also function in challenging dynamic environments, while the remaining systems focus solely on environments without dynamic changes. Furthermore, in subplot “f,” most methods are applicable in “indoor environments” or “both indoor and outdoor environments,” while the remaining papers tested only under outdoor conditions. It should be noted that if the methods adopted in other cases only work under specific restrictive assumptions, they may not yield the same accuracy. This is one of the main reasons why some methods focus solely on specific situations.

Analyzing Current Trends

This paper reviews state-of-the-art visual SLAM methods that have attracted significant attention and showcases their major contributions in the field. Despite extensive reliable solutions and improvements across various modules of VSLAM systems in recent years, there remain many high-potential areas and unresolved issues that require research, leading to the adoption of more robust methods in the future development of SLAM. Given the broad nature of visual SLAM methods, the paper presents the following open research directions:

Deep Learning: Deep neural networks have shown promising results in various applications, including VSLAM, making them an important trend in multiple research areas. Due to their learning capabilities, these architectures have demonstrated considerable potential as reliable feature extractors to address different issues in VO and loop detection. CNNs can assist VSLAM in precise target detection and semantic segmentation, and they may outperform traditional feature extraction and matching algorithms in correctly identifying manually designed features. It must be noted that since deep learning-based methods are trained on datasets with a large variety of data and limited target classes, there is always a risk of misclassifying dynamic points, leading to incorrect segmentation. This may result in lower segmentation accuracy and pose estimation errors.

Information Retrieval and Computational Cost Trade-offs: Generally, processing costs and the amount of information in the scene should always be balanced. From this perspective, dense maps allow VSLAM applications to record high-dimensional complete scene information, but doing so in real-time will require computational load. On the other hand, sparse representations may not capture all necessary information due to their lower computational costs. It should also be noted that real-time performance is directly related to the camera’s frame rate, with peak processing time frame losses negatively impacting VSLAM system performance, regardless of algorithm performance. Moreover, VSLAM typically utilizes tightly coupled modules, and modifying one module may adversely affect other modules, making balancing tasks more challenging.

Semantic Segmentation: Providing semantic information while creating environmental maps can offer very useful information to robots. Identifying targets in the camera’s field of view (e.g., doors, windows, people, etc.) is a hot topic in current and future VSLAM work, as semantic information can be utilized for pose estimation, trajectory planning, and loop detection modules. With the widespread use of target detection and tracking algorithms, semantic VSLAM will undoubtedly become one of the future solutions in this field. Loop Algorithms: One of the key issues in any SLAM system is the drift problem, as well as the loss of feature trajectories due to accumulated localization errors. In VSLAM systems, detecting drift and loops to identify previously visited locations leads to computational delays and high costs. The primary reason for this is that the complexity of loop detection increases with the size of the reconstructed map. Moreover, combining map data collected from different locations and refining pose estimates is a very complex task. Therefore, optimizing and balancing the loop detection module has significant room for improvement. One common method for detecting loops is to improve image retrieval by training visual vocabulary based on local features and then aggregating them.

Working in Challenging Scenarios: Operating in textureless environments with few significant feature points often leads to drift errors in the robot’s position and orientation. As one of the main challenges of VSLAM, this error can result in system failures. Thus, considering complementary scene understanding methods, such as target detection or line features, in feature-based approaches will be a popular topic.

Conclusion

This paper presents a range of SLAM algorithms where visual data collected from cameras plays a crucial role. The paper categorizes recent works based on various characteristics of VSLAM system methods, such as experimental environments, novelty domains, objectives, detection, and tracking algorithms, semantic-level survivability, performance, etc. The paper also reviews the key contributions of related algorithms and the existing deficiencies and challenges based on the authors’ claims, improvements for future versions, and issues addressed in other relevant methods. Another contribution of this paper is to discuss the current trends in VSLAM systems and the existing open questions for researchers to pursue further studies.

References

[1] Visual SLAM: What are the Current Trends and What to Expect?

This article is for academic sharing only, please contact us to delete if there is any infringement.

Download and Learn Valuable Content

Reply in the background:BarcelóAutonomous University Courseware, to download high-quality 3D Vision courseware accumulated over years by foreign universities

Reply in the background:Computer VisionBooks, to download classic books in the field of 3D vision pdf

Reply in the background:3D Vision Course, to learn high-quality courses in the field of 3D vision

3D Vision Workshop Official Website:3dcver.com

1. Multi-sensor data fusion technology for autonomous driving

2. Full-stack learning route for 3D point cloud object detection in the field of autonomous driving! (Single-modal + multi-modal/data + code) 3. Thoroughly understand visual 3D reconstruction: principle analysis, code explanation, and optimization improvement 4. The first industrial-grade practical point cloud processing course in China 5. Sorting and code explanation of laser-vision-IMU-GPS fusion SLAM algorithms 6. Thoroughly understand visual-inertial SLAM: based on VINS-Fusion officially launched 7. Thoroughly understand 3D laser SLAM based on the LOAM framework: from source code analysis to algorithm optimization 8. Thoroughly analyze key algorithm principles, code, and practical applications of indoor and outdoor laser SLAM (cartographer + LOAM + LIO-SAM)

9. Build a structured light 3D reconstruction system from scratch [theory + source code + practice]

10. Monocular depth estimation methods: algorithm sorting and code implementation

11. Practical deployment of deep learning models in autonomous driving

12. Camera models and calibration (monocular + stereo + fisheye)

13.Important! Quadrotor: Algorithms and Practical Applications

14.ROS2 from beginner to expert: theory and practice

15. The first tutorial on 3D defect detection in China: theory, source code, and practical applications

16. Introduction and practical tutorial on point cloud processing based on Open3D

Important!3DCVer-Academic Paper Writing Submission Group has been established

Scan to add the assistant’s WeChat, you can apply to join the 3D Vision Workshop – Academic Paper Writing and Submission WeChat group, aimed at exchanging writing and submission matters for top conferences, top journals, SCI, EI, etc.

You can also apply to join our segmented direction group, currently mainly including3D vision、CV&Deep Learning、SLAM、3D Reconstruction、Point Cloud Post-Processing、Autonomous Driving, Multi-Sensor Fusion, CV Introduction, 3D Measurement, VR/AR, 3D Face Recognition, Medical Imaging, Defect Detection, Pedestrian Re-identification, Object Tracking, Visual Product Implementation, Visual Competitions, License Plate Recognition, Hardware Selection, Academic Exchange, Job Exchange, ORB-SLAM Series Source Code Exchange, Depth Estimation and other WeChat groups.

Be sure to note: Research Direction + School/Company + Nickname, for example: “3D Vision + Shanghai Jiao Tong University + Jingjing”. Please follow the format to ensure quick approval and invitation to the group. Original submissions please also contact.

▲ Long press to add WeChat group or submit, add WeChat:dddvision

▲ Long press to follow the public account

3D Vision from Beginner to Expert Knowledge Planet: Focused on the 3D vision field video courses (three-dimensional reconstruction series, three-dimensional point cloud series, structured light series, hand-eye calibration, camera calibration, laser/visual SLAM, autonomous driving etc.), Knowledge point summary, entry and advanced learning routes, latest paper sharing, Q&A are cultivated in five aspects, with various algorithm engineers from major companies providing technical guidance. Meanwhile, the planet will collaborate with well-known companies to release information on algorithm development positions and project docking in the field of 3D vision, creating a hard-core fan gathering area that integrates technology and employment. Nearly 6000 planet members work together to create a better AI world, making progress together, Knowledge Planet entrance:

Learn core technologies of 3D vision, scan to view the introduction, unconditional refund within three days

There are high-quality tutorial materials, Q&A, helping you efficiently solve problems in the circle

If you find it useful, please give a like and a look~