Comprehensive Guide to vSLAM Development: From Technical Frameworks to Hardware Selection

Click on the top “Beginner Learning Visual“, select “Starred” public account

Important content delivered to you first

Produced by | Smart Things Open Class Instructor | Yang Ruihuan, CTO of Xiaomi Smart Editor | Wang Xin

Today, I want to share with you the topic “How to Use vSLAM to Help Robots Achieve Accurate Navigation and Obstacle Avoidance in Different Scenarios.” We will share from the following four aspects:

1. The Development History of vSLAM

2. Technical Principles of vSLAM and Different Algorithm Implementations

3. Application Challenges and Solutions of vSLAM in Different Scenarios

4. Practical Applications of vSLAM in Real-Time Navigation and Obstacle Avoidance for Robots

The Development History of vSLAM

The development history of vSLAM can also be seen as the evolution of visual sensors, as the two are interdependent. The first to emerge was the monocular camera, leading to monocular vSLAM algorithms. The Kinect camera of Xbox turned structured light and ToF into consumer electronics and achieved large-scale production. A monocular camera combined with a set of depth cameras led to RGB-D SLAM, which also used depth maps converted into Laser-Scan for mapping.

Comprehensive Guide to vSLAM Development: From Technical Frameworks to Hardware Selection

The binocular camera, due to its high computational requirements, has gradually become a mainstream sensor thanks to the increasing computational power of CPUs and GPUs in recent years. Visual odometry requires high synchronization accuracy between vision and IMU, so we also developed dedicated binocular inertial cameras. Multi-camera systems can be simply understood as multiple sets of binoculars that expand the field of view in different directions.

Now, let me briefly introduce the characteristics of each camera:

-Monocular Camera

The biggest problem with monocular cameras is the lack of scale information, such as the proportional relationship in images. Humans can distinguish objects because we have undergone extensive training, similar to deep learning. However, machine vision cannot obtain spatial relationships from a single image, which requires moving the camera to estimate the relationship between images at different poses to obtain scale information, effectively functioning like a variable baseline binocular camera.

-Structured Light Camera

Structured light cameras project infrared patterns and use IR cameras to receive the deformed image to calculate depth. The distance between the IR emitter and receiver is similar to the baseline of binocular cameras, and the principle is akin to triangulation. Various patterns are used, such as striped patterns. Kinect uses random speckles, providing three different sizes of speckles at three different distances for measuring different ranges.

-ToF Camera

ToF cameras emit continuous light pulses and use sensors to receive reflected light, calculating distance based on the time of flight. Current ToF cameras generally have low resolution, typically QVGA (QuarterVGA), although VGA resolution is now available, but higher resolution increases costs.

-Binocular Camera

Binocular cameras operate similarly to human eyes, where depth information is derived from the different parallax seen by the left and right eyes, following the same principle.

This diagram shows a simplified model of binocular triangulation, where P represents the object being measured, z represents the distance between the object and the camera, b represents the baseline, f represents the focal length, and d represents the parallax. Using the principle of similar triangles: z/f=b/d. In practice, the algorithm for binocular parallax is more complex, with algorithms like BM and SGBM available in OpenCV.

Because the calculation of binocular parallax does not rely on other sensors, binocular cameras can be effectively applied indoors and outdoors. The above image shows the results seen in indoor applications. For low-texture surfaces, such as white walls or textureless tables, we can enhance the surface texture information by overlaying some IR structured light patterns, which facilitates the calculation of binocular parallax.

The above shows the performance of binocular cameras in outdoor settings. Structured light cameras primarily emit and receive IR light, but sunlight contains a lot of IR light that can interfere, which is why most structured light cameras and ToF cameras do not perform well outdoors. However, binocular cameras do not have this problem, which is why they are widely used in many automotive ADAS systems.

We integrate the binocular structured light with IMU into one camera, referred to as a binocular structured light inertial camera, to achieve complementary performance through multi-sensor fusion in different textures both indoors and outdoors, even under rotating conditions.

Technical Principles of vSLAM and Different Algorithm Implementations

Next, I will introduce the technical principles of vSLAM and some implementations of open-source VSLAM algorithms. The technical framework of vSLAM mainly includes sensor preprocessing, frontend, backend, loop closure detection, and mapping.

-Sensors

Commonly used sensors include monocular, binocular, and RGBD. Due to the uncertainty of scale in monocular cameras, initialization time is generally longer, and the estimated pose will accumulate errors. In contrast, binocular cameras have scale information, allowing for quick initialization, but they require calibration. RGBD cameras are significantly affected by outdoor sunlight and interference from reflective surfaces. Adding an IMU necessitates considering synchronization between the IMU and images.

-Frontend

The frontend, also known as visual odometry, estimates the motion of adjacent frame images and the motion of the camera between frames, effectively addressing the localization problem, involving feature extraction and pose estimation algorithms.

The visual odometry mentioned here can only estimate motion between adjacent frames, which leads to cumulative errors over time, necessitating backend optimization and loop closure detection for closed-loop processing.

-Backend

Backend optimization mainly includes two types of algorithms: optimization based on filtering theory and nonlinear optimization. In recent years, nonlinear optimization-based vSLAM has become increasingly prevalent and has gradually become mainstream, although filtering theory has a longer development history and has had excellent application cases, making it a robust method.

-Loop Closure Detection

Loop closure detection is analogous to how humans use their eyes to determine whether two places are the same. When the same point is detected, a loop correction is performed. However, a potential issue arises when highly similar scenes lead to false closures, such as walls with identical repeating textures. In such cases, additional constraints are needed to avoid erroneous closures.

-Mapping

vSLAM establishes several types of maps:

(1) Feature maps store geometric feature points and keyframes for localization and navigation, which can also be saved as maps for loading during navigation.

(2) Topological maps, also known as statistical maps, maintain the relative positional relationships of points and lines but do not concern themselves with area distances or directions.

(3) Grid maps are two-dimensional maps that include size information, where each grid represents a fixed unit of size, and the entire map is described according to size. Two-dimensional maps are generally suitable for ground movement without significant height for robots.

(4) Three-dimensional maps can restore three-dimensional environments and scenes, such as Octomap or point cloud maps. With these maps, we can perform spatial three-dimensional reconstruction or path planning, such as in UAV path planning and navigation.

Next, I will quickly introduce several open-source SLAM systems:

-ORB_SLAM

ORB_SLAM’s core is the use of ORB as the core feature of vSLAM. It is a complete SLAM system that includes visual odometry, tracking, and loop closure detection, functioning as a monocular SLAM system based entirely on sparse feature points. It also has interfaces for monocular, binocular, and RGBD cameras.

The ORB feature combines the FAST feature detection method with the BRIEF feature descriptor, improving and optimizing their original foundations, resulting in significant enhancements in both effectiveness and speed. The link above illustrates the journey of our binocular products integrating ORB_SLAM, which does not incorporate IMU calculations.

The author of ORB published a paper on inertial monocular vSLAM, addressing pure rotation issues. A domestic expert, Wang Jing, made a simple implementation, but it remains primarily a validation project with some existing bugs, and the code address is provided below for your learning.

-VINS

VINS-Mono and VINS-Mobile are open-source monocular visual inertial SLAM solutions developed by Professor Shen Shaojie’s team at the Hong Kong University of Science and Technology. They are classic and highly recommended monocular VIO projects. They require low overall computational power and provide decent accuracy, especially VINS-Mobile, which can run on iOS.

VINS-Fusion is an extension of VINS-Mono that supports multi-sensor integration, including monocular + IMU, binocular + IMU, and even pure binocular setups. It also offers versions with added GPS. We were fortunate to collaborate with the expert Qin Tong on optimizing sensors in VINS, providing optimization solutions and support, greatly enhancing VINS during its early release.

-OKVIS

OKVIS is a binocular VIO project released by ETH Zurich. It is a classic binocular VIO but only outputs six degrees of freedom poses, lacking loop closure detection and mapping, so it is not a complete SLAM system. While it has good accuracy, it can experience pose drift if stationary for long periods, which is an issue within the project.

The code structure of OKVIS is very clear, but it is limited to localization functionality. However, it includes tight coupling and multi-sensor fusion, and the code is highly recommended for learning.

-maplab

maplab is another vSLAM framework introduced by ETH Zurich following OKVIS. It consists of two parts: ROVIO and SLAM offline processing console.

-MSCKF

MSCKF is a classic vSLAM based on filtering theory optimization. Its advantages include low computational requirements and decent accuracy.

Application Challenges and Solutions of vSLAM in Different Scenarios

and Solutions

This section primarily compares the performance of several common VIO systems and offers some considerations for sensor selection.

The test comparison utilized models including Intel NUC, Odroid XU4, and Up Board. Here, NUC and UpBoard are both X86 architecture, while XU4 uses an ARM architecture motherboard.

In this graph, the x-axis represents pose errors, which should be minimized, while the y-axis represents CPU usage, memory usage, and processing time. From the results, VINS, OKVIS, and ROVIO show good performance. As new algorithms are released, new evaluations will also emerge, so stay tuned.

-Image Sensors

For vSLAM, global shutter sensors are preferable to rolling shutter sensors because global shutters expose the entire image at once, while rolling shutters expose line by line. There may be a misconception that global shutters do not cause blurring; however, blurring is related to exposure time. In dim environments where automatic exposure time is extended, motion can still cause blurring. Factors such as the sensor’s light-sensitive area and lens aperture also play a role. Global shutters primarily address the jelly effect problem.

-Lenses

For lenses, a larger field of view is better for SLAM systems, as more information in the field can capture more feature points. However, when selecting wide-angle lenses, the image distortion should not be too extreme, and calibration and correction models must match for effective use.

-Binocular Sensors

Additionally, it is crucial to ensure synchronization between binocular sensors; the left and right eyes must expose at the same time, with identical AE (automatic white balance) and AWE (automatic exposure).

-Image + IMU

Strict synchronization between images and IMU is also required. The best scenario is complete alignment between IMU and images, which is challenging but achievable with our sensors, providing a significant advantage over competitors. The second scenario involves synchronized clocks between the sensor and IMU, allowing for fixed offsets to be estimated and applied during initialization. The third scenario involves desynchronization, which is catastrophic for vSLAM and renders it unusable.

During productization, we must also accelerate algorithms, including CPU instruction set acceleration, GPU optimizations via CUDA or Open CL, and accelerations using FPGA, DSP, and ASIC chips. There is a misconception that many companies can achieve productization through minor modifications of open-source projects, but this is very challenging. Algorithms need to be optimized for specific scenes, fused with different sensors, and accelerated for specific platforms to reduce host configurations and costs, making successful product development possible.

We have also run VINS-related algorithms on the Jetson TX 2, and it is worth noting that VINS has a dedicated project for GPU optimization, VINS-Fusion, which enhances performance on NVIDIA GPU products.

Applications of vSLAM in Different Scenarios

As vSLAM has developed, it has found mature applications across many fields. Below, we will explore the products in which vSLAM has been applied.

For current positioning, navigation, and obstacle avoidance solutions on the market, binocular + IMU products are gradually becoming the mainstream choice due to their accuracy and cost-effectiveness, while also offering recognition capabilities within visual sensors.

Previously, AR and VR primarily relied on external devices for positioning or required deploying QR codes in the environment, which had stringent requirements for the environment and deployment.

Now, some new VR and AR positioning products utilize VIO technology, enabling headsets to achieve autonomous positioning, focusing on high image frame rates, with positioning frame rates needing to be high, sensor weight kept low, and power consumption minimized. High positioning frame rates relate to the headset’s image refresh rate, ensuring a good response speed to prevent dizziness.

-Drone Obstacle Avoidance

For drone obstacle avoidance, an increasing number of solutions now utilize binocular cameras, as they perform well for outdoor navigation and obstacle avoidance, with higher resolutions leading to better object detection.

-Autonomous Driving

In autonomous driving applications, most rely on multi-sensor fusion solutions, including high-precision GPS, millimeter-wave radar, cameras, lidar, and inertial sensors. For the visual portion, we focus on dynamic range, allowing cameras to perform well in low-light or backlit environments (e.g., tunnels). We also prefer vehicle-grade sensors, as the temperature inside vehicles can become very high under direct sunlight.

The image below shows the positioning and navigation of a dataset fused with GPS using VINS, demonstrating that binocular + GPS can achieve excellent results.

Practical Applications of vSLAM in Real-Time Navigation and Obstacle Avoidance for Robots

In our security robot, we used a vacuum cleaner chassis and our binocular camera for localization, navigation, and obstacle avoidance. A 2D laser radar is used for mapping, along with a panoramic camera for capturing panoramic images. Below, we will discuss the roles of different sensors in this system.

-Ultrasonic Sensors

Ultrasonic sensors are not very precise and are easily affected by external environments. Their primary role here is to solve the issue of visual and laser sensors being ineffective at recognizing transparent objects like glass, as ultrasonic sensors can effectively detect glass.

-Binocular Sensors

Binocular sensors provide pose information, including point clouds and depth for obstacle avoidance, and can also perform loop closure detection, primarily serving localization, navigation, and obstacle avoidance functions.

-IMU Sensors

We have fused IMU with vision algorithms in vSLAM VIO, using consumer-grade IMUs that are not highly precise but yield excellent results.

-Laser Radar

Most products currently use single-line laser radars due to the high costs of multi-line laser radars. Laser radars perform well in mapping accuracy and special scenarios. The number of redundant sensors and their coordination in this system must be determined based on the application scenario and product positioning, as some scenarios may achieve excellent results without laser radars.

-Chassis Odometry

Additionally, it’s important to consider the accuracy of chassis odometry, as wheel rotation is generally accurate unless there are slip or uneven surfaces, which necessitates a multi-sensor fusion solution.

-Host

Currently, common hosts are GPU or X86-based, typically using NVIDIA embedded platforms, optimizing algorithms through ARM + GPU configurations. To meet low-cost requirements, we also explore porting and optimizing on other platforms.

END

Comprehensive Guide to vSLAM Development: From Technical Frameworks to Hardware Selection

Related posts

Leave a Comment Cancel reply