Click on the above “Computer Vision Life“, select “Star”
Quickly get the latest insights
This article is reproduced from: 3D Vision Workshop
Most existing visual SLAM methods rely heavily on the static world assumption, which can easily fail in dynamic environments. This paper proposes a real-time semantic RGB-D SLAM system in dynamic environments that can detect known and unknown moving objects. To reduce computational costs, it performs semantic segmentation only on keyframes to remove known dynamic objects while maintaining static mapping for robust camera tracking. In addition, the paper introduces an effective geometric module that clusters depth images into several regions and identifies dynamic areas based on their reprojection errors, thereby detecting unknown moving objects.
1 Introduction
Although many existing vSLAM systems perform well, most of these methods heavily rely on the static world assumption, which greatly limits their deployment in real-world scenarios.
Dynamic objects such as moving people, animals, and vehicles negatively impact pose estimation and map reconstruction. Although robust estimation techniques (such as RANSAC) can be used to filter out some outliers, improvements are still limited because they can only handle slight dynamic scenes, and they may still fail when moving objects cover a large part of the camera view.
With the latest advances in computer vision and deep learning, semantic information from the environment has been integrated into SLAM systems, such as extracting semantic information through semantic segmentation, predicting labels of detected objects, and generating masks. By identifying and removing potential dynamic targets, the performance of vSLAM in dynamic scenes can be significantly improved.
However, these methods still have two main issues:
-
Powerful semantic segmentation neural network algorithms are computationally expensive and not suitable for real-time and small-scale robotic applications. -
For lightweight networks, segmentation accuracy may decrease, and tracking accuracy may also be affected. Another issue is that they can only handle known objects that are labeled in the training set of the network, and they may still fail when facing unknown moving objects.
To identify dynamic objects with semantic clues, most existing methods perform semantic segmentation on each new frame. This leads to a significant slowdown in camera tracking since the tracking process must wait until segmentation is completed.

The main contributions of this paper are as follows:
-
A keyframe-based semantic RGB-D SLAM system is proposed, which can reduce the impact of moving objects in dynamic environments. -
An effective and efficient geometric module is proposed to handle unknown moving objects, in conjunction with the semantic SLAM framework. -
The accuracy of the proposed method is demonstrated through comparative experiments with state-of-the-art dynamic SLAM methods, while still being able to run in real-time on embedded systems.
2 Algorithm Framework
The overall framework of the algorithm is shown in the figure below:

2.1 Semantic Module
Semantic segmentation predicts pixel labels and generates masks for detected objects in the input RGB image using deep learning-based methods, and the semantic module adopts a lightweight semantic segmentation network SegNet.
The segmentation network is pre-trained on the PASCAL VOC dataset, which contains 20 categories of objects. Among these objects, only highly mobile or potentially dynamic objects, such as people, cars, bicycles, etc., are processed. These targets will be removed from the segmentation image, and the associated feature points will not be used for camera tracking and map construction.
Unlike most existing learning-based dynamic SLAM methods, this model performs semantic segmentation only when creating new keyframes, rather than performing semantic segmentation on each new frame. This significantly reduces the computational cost of the semantic module, helping to achieve real-time tracking of semantic information. Furthermore, this process is executed in a separate thread, so it does not significantly impact the overall tracking time.
2.2 Geometric Module
Since the separate semantic information can only detect a fixed number of object classes labeled in the training set, tracking and mapping will still be affected in the presence of unknown moving objects. Therefore, a geometric module that does not require prior information is needed.
The K-Means algorithm is first used to segment each new depth image into N clusters, grouping points that are close to each other in 3D space. It is assumed that each cluster is the surface of an object, and the points in the cluster share the same motion constraints. Since a single object can be segmented into several clusters, the objects do not need to be rigid, while most semantic SLAM methods have this rigidity assumption.
For each cluster, the average reprojection error of all feature points in the cluster relative to their matching correspondences Pi in three-dimensional space is calculated, as defined in (1), where m is the number of matched features, is the camera pose, π represents the camera projection model, and ρ is the penalty function.
When the error of a cluster is relatively larger than that of other clusters, it is marked as a dynamic cluster. All feature points in the dynamic cluster will be removed and will no longer participate in camera pose estimation. This clustering method is more effective and efficient compared to identifying the dynamic state of individual feature points. Additionally, it prevents false positives caused by single-point measurement noise. It also allows us to approximate the general shape of moving objects through geometric clustering. Some results of this method can be seen in the third row of the figure below, where dynamic clusters are highlighted in red. The module can work independently and does not require semantic information, thus enabling the detection of unknown moving objects.

The first row shows the dynamic features detected by the proposed semantic module (blue rectangular points) and the geometric module (red points). The second row shows the corresponding semantic segmentation results. The third row displays the geometric clustering results of the depth image, with dynamic clusters highlighted in red. (a) and (b) show that both modules detect dynamic targets. (c)-(h) indicate semantic segmentation failures, while the geometric module succeeds in segmentation (the geometric module can continue to operate in the event of semantic module failure).
During experiments, the authors observed an interesting phenomenon where some semi-dynamic objects could also be identified. As shown in the above figure (h), the left chair was identified as dynamic. The reason is that the chair is currently static, but its position changes when revisited. This is helpful for constructing long-term consistent maps.
2.3 Keyframe and Local Map Update
Semantic information is extracted only from keyframes. Since new frames are tracked using keyframes and local maps, we only need to ensure that the segmented keyframes and local maps contain only the static parts of the scene. The keyframe selection strategy inherits from the original ORB-SLAM2 system. When a new keyframe is selected during tracking, semantic segmentation is performed in a separate thread, and dynamic feature points are removed. The local map is also updated by removing the corresponding dynamic map points.
In this way, a keyframe database and a map containing only static features and map points are maintained.
2.4 Tracking
Inherit from ORB-SLAM2, a two-stage tracking is performed for each new frame. First, initial tracking is performed using the most overlapping recent keyframe to obtain an initial pose estimate. Since the keyframe has been improved and potential dynamic objects have been removed, this initial estimate will be more reliable.
Then, the initial pose estimate is used in the geometric module for dynamic object detection. After removing dynamic points from the current frame, the geometric module tracks using all local map points observed in the current frame, obtaining a more accurate pose estimate through local bundle adjustment. Since the semantic module has also removed potential dynamic map points from the local map, it further reduces the impact of dynamic targets, making pose estimation more robust and accurate.
3 Experiments and Results
The method in this paper has been tested on the TUM RGB-D dataset, which is widely used for RGB-D SLAM evaluation.
Evaluation Metrics: The error metrics used for evaluation are the commonly used root mean square error (RMSE) of the absolute trajectory error (ATE), as well as the RMSE of the relative pose error (RPE) that includes translation drift and rotation drift. ATE measures the global consistency of the trajectory, while RPE measures the drift of the mileage per second.
3.1 Role of Different Modules
The RMSE of ATE compared to the baseline ORB-SLAM2 is shown in the table below.

Experimental results:
-
For slightly dynamic sequences, the results of the proposed method are similar to those of ORB-SLAM2, as ORB-SLAM2 can successfully handle these situations through the RANSAC algorithm, so the improvement is limited.
-
For highly dynamic sequences, both the semantic module and the geometric module in this paper achieve significant accuracy improvements, and the proposed combined system achieves better results.
The figure below compares the trajectories estimated by ORB-SLAM2 and the proposed method based on the ground truth
3.2 Comparison with State-of-the-Art Methods
The authors compared the proposed method with state-of-the-art geometry-based dynamic SLAM methods MR-DVO, SPW, StaticFusion, DSLAM, as well as learning-based methods MID-Fusion, EM-Fusion, DS-SLAM, and DynaSLAM.
The comparisons of ATE and RPE are summarized in tables 2 and 3, respectively.

It can be seen that the method in this paper provides very good results in all dynamic sequences and outperforms all other dynamic SLAM methods, except for DynaSLAM, which combines multi-view geometry in the semantic framework. However, DynaSLAM provides offline static map creation and cannot run in real-time due to its time-consuming Mask-RCNN network and region growing algorithm. Nevertheless, the method in this paper provides results that are very close to it while achieving real-time operation.
3.3 Robustness Test in Real Environments
In real experiments, a person holding a book walks in front of the camera while the camera remains almost stationary. The figure below shows several screenshots of dynamic point detection results during real-time testing, where the second and third rows are segmentation results from the semantic module and the proposed geometric module, respectively.

The book is not a labeled object in the network model, so it cannot be recognized, or it is sometimes incorrectly identified by the semantic module, as shown in the second row. As a compensatory process, the geometric module is able to correctly extract the book as a moving object during testing, as shown in the third row. This indicates that both the semantic module and the geometric module are essential for a robust semantic RGBD SLAM system in dynamic environments. The average trajectory estimation error of the method is approximately 0.012m, while ORB-SLAM2 has an error of approximately 0.147m due to larger fluctuations caused by moving objects.
4 Conclusion
This paper proposes a real-time semantic RGB-D SLAM framework that can handle known and unknown moving objects.
To reduce computational load, a keyframe-based semantic module is proposed, along with an effective geometric module based on geometric clustering to handle unknown moving targets. Extensive evaluations show that the system in this paper provides state-of-the-art positioning accuracy while still being able to run in real-time on embedded platforms.
Future improvements: A long-term semantic map containing only static parts of the environment can be constructed, which would be useful for advanced robotic tasks.

最后介绍一下我们最近的几个学习活动:
1、《论文/项目2个月集训实战》
项目课1:基于DSO和MVSNet的实时位姿估计与稠密重建
项目课2:基于ORB-SLAM2的点线特征位姿估计
详情查看 SLAM实战项目来啦!基于DSO和MVSNet的实时位姿估计与稠密重建、视觉点线SLAM
2、【报名】申请国内外博士交流小组
主要目的是聚集SLAM相关方向申请博士的同学,不仅仅是简单的群内交流,更希望能组织大家一起分工协作,进行内部分享,互帮互助,成功上岸~ 仅限 「小六的机器人SLAM学习圈」参与。详见 【报名】申请国内外博士交流小组
3、【报名】创业交流小组
组织有创业想法,有创业行动,或者正在创业中的朋友一起学习交流,大家一起去思想碰撞,相互学习,甚至产生合作机会。仅限 「小六的机器人SLAM学习圈」参与。
4、小六的机器人SLAM圈!
全国最大的机器人SLAM交流圈,详情戳「小六的机器人SLAM学习圈」
长按二维码,加入即可开始学习~3天内无条件退款
购买后务必加客服领取其他权益和加交流群
现在加入免费送近千页5年机器人SLAM答疑手册
全国最大的机器人SLAM开发者社区
长按二维码,加入即可开始学习~3天内无条件退款
购买后务必加客服领取其他权益和加交流群
现在加入免费送5年SLAM答疑精华手册(共938页)
学机器人SLAM/3D视觉,就上cvlife.net !
点击领取学习资料 → 机器人SLAM学习资料大礼包
技术交流群