Comprehensive Guide to vSLAM Development: From Technical Frameworks to Hardware Selection

Produced by | Smart Things Open Class

Instructor | Yang Ruihuan, CTO of Xiaomi Intelligent

Editor | Wang Xin

Reminder | Click on the top to follow us, and reply with the keyword CV06 to obtain the courseware.

Introduction:

On October 17, the “Smart Things Open Class” launched the second season of the computer vision application series, the third lecture, themed “How to Use vSLAM to Help Robots Achieve Accurate Navigation and Obstacle Avoidance in Different Scenarios,” presented live by Yang Ruihuan, CTO of Xiaomi Intelligent.

In this lecture, Yang Ruihuan began by discussing the development history of vSLAM, introduced different cameras, analyzed the technical principles and various algorithm implementations, proposed solutions for different scenarios, and cited application cases of vSLAM in robotics.

This article is a transcript of this open class.

Hello everyone, I am Yang Ruihuan, CTO of Xiaomi Intelligent. I am very glad to have the opportunity to communicate and learn together! First, let me introduce our company. Xiaomi Intelligent is a company focused on binocular vision sensors, vSLAM algorithms, and solutions, and we have launched several cameras suitable for vSLAM.

Today, I will share with you the theme “How to Use vSLAM to Help Robots Achieve Accurate Navigation and Obstacle Avoidance in Different Scenarios.” I will share from the following four aspects:

1. Development history of vSLAM

2. Technical principles and different algorithm implementations of vSLAM

3. Application challenges and solutions of vSLAM in different scenarios

4. Practical applications of vSLAM in real-time navigation and obstacle avoidance in robots

Development History of vSLAM

The development history of vSLAM can also be regarded as the evolution of visual sensors, as the two are interdependent. The earliest was the monocular camera, leading to monocular vSLAM algorithms. The Xbox Kinect camera turned structured light and ToF into consumer electronics, achieving large-scale production. A monocular camera combined with a depth camera resulted in RGB-D SLAM, which also converted depth maps to Laser-Scan for use in image recognition.

Comprehensive Guide to vSLAM Development: From Technical Frameworks to Hardware Selection

The binocular camera has higher computational requirements, but thanks to the increasing power of CPUs and GPUs in recent years, it has gradually become the mainstream sensor. Visual odometry requires high synchronization accuracy between vision and IMU, so we have also developed dedicated binocular inertial navigation cameras. A multi-camera system can be simply understood as multiple sets of binoculars, expanding the field of view and different directions.

Next, I will briefly introduce the characteristics of each camera:

– Monocular Camera

The biggest problem with monocular cameras is the lack of scale information, such as the proportional relationship in images. Humans can distinguish based on extensive training, similar to deep learning. However, machine vision cannot derive spatial relationships from a single image, requiring camera movement to estimate the image relationships at different poses, thus obtaining scale information, which is equivalent to a variable baseline binocular camera.

– Structured Light Camera

Structured light cameras project infrared patterns and use IR cameras to receive image deformations to calculate depth. The distance between the IR emitter and receiver is similar to the baseline of binoculars, and the distance measurement principle is also akin to triangulation. Various patterns can be used, such as striped patterns. Kinect uses random speckle patterns, providing three sizes of speckles at three different distances for measuring different ranges.

– ToF Camera

ToF cameras emit continuous light pulses and use sensors to receive reflected light, calculating distances based on the time of flight. Currently, ToF cameras generally have low resolution, typically QVGA (Quarter VGA), although VGA resolutions are also available, but costs increase with resolution.

– Binocular Camera

Binocular cameras operate similarly to human eyes, obtaining depth information through the disparity between the left and right eyes, following the same principle.

This image illustrates a simplified model of binocular triangulation, where P represents the object being measured, z is the distance between the object and the camera, b is the baseline, f is the focal length, and d is the disparity. By the principle of similar triangles: z/f = b/d. In fact, the algorithm for binocular disparity is more complex; for example, OpenCV includes BM and SGBM algorithms.

Since the calculation of binocular disparity does not rely on other sensors, binocular cameras can be effectively applied both indoors and outdoors. The above image shows the results observed in indoor applications. For low-texture surfaces, such as white walls or featureless tables, we can overlay IR structured light patterns to enhance surface texture information, facilitating the calculation of binocular disparity.

The above shows the performance of binocular cameras outdoors. Structured light cameras primarily emit and receive IR light; however, the abundant IR light in sunlight can interfere, so most structured light and ToF cameras do not perform well outdoors. In contrast, binocular cameras do not face this issue, which is why many automotive ADAS systems use binocular cameras.

We combine binocular structured light with IMU into a single camera, referred to as a binocular structured light inertial navigation camera, achieving excellent application performance through multi-sensor fusion and complementarity, even under rotating conditions.

Technical Principles of vSLAM and Different Algorithm Implementations

Next, I will introduce the technical principles of vSLAM and some implementations of open-source VSLAM algorithms. The technical framework of vSLAM mainly includes sensor preprocessing, front-end, back-end, loop closure detection, and mapping.

– Sensors

Commonly used sensors include monocular, binocular, and RGBD. Due to the uncertainty of scale in monocular cameras, initialization generally takes longer, and the estimated poses accumulate errors. In contrast, binocular cameras have scale information, allowing for quick initialization but requiring calibration. RGBD cameras are significantly affected by outdoor sunlight, and reflective surfaces can interfere with RGBD cameras. When integrating IMU, synchronization between IMU and images must also be considered.

– Front-end

The front-end, also known as visual odometry, estimates the motion of adjacent frames and the movement of the camera between frames, addressing the localization problem. This involves feature extraction and pose estimation algorithms.

The visual odometry mentioned here only estimates motion between adjacent frames, resulting in local estimates that can accumulate errors over time, necessitating back-end optimization and loop closure detection for closing the loop.

– Back-end

Back-end optimization primarily involves two types of algorithms: filter-based optimization and nonlinear optimization. In recent years, nonlinear optimization-based vSLAM has become increasingly prevalent, gradually becoming the mainstream method. However, filter-based theory has been developed for a longer time and has had successful application cases, making it highly viable.

– Loop Closure Detection

Loop closure detection is similar to how humans use their eyes to determine if two places are the same. When identical points are detected, a loop closure correction is performed. However, a potential issue arises when highly similar scenes lead to erroneous closures, such as walls with identical repetitive textures. In such cases, additional constraints must be added to avoid erroneous closures.

– Mapping

vSLAM creates several types of maps:

(1) Feature maps store geometric feature points and key frames for localization and navigation, which can also be saved as map data for navigation.

(2) Topological maps, also known as statistical maps, preserve the relative positions of points and lines but do not concern themselves with area, distance, or direction accuracy.

(3) Grid maps are two-dimensional maps that incorporate size information. For instance, the small squares in the background of the above image represent fixed units of size, with the entire map described in terms of size. Two-dimensional maps are generally more suitable for ground movement without significant height variation.

(4) Three-dimensional maps can restore the three-dimensional environment and scenes, such as Octomap or point cloud maps. With these maps, we can perform spatial three-dimensional reconstruction or path planning, such as for drone path planning and navigation.

Next, I will quickly introduce several open-source SLAM options:

– ORB_SLAM

The core of ORB_SLAM uses ORB as the core feature of vSLAM. It is a complete SLAM system, including visual odometry, tracking, and loop closure detection, and is entirely based on sparse feature points in a monocular SLAM system. It also provides interfaces for monocular, binocular, and RGBD cameras.

ORB features combine FAST feature detection methods with BRIEF feature descriptors, improving and optimizing their original foundations, significantly enhancing both effect and speed. The above link details the integration of our binocular products with ORB_SLAM, which does not include IMU calculations.

The author of ORB published an article on inertial navigation-based monocular vSLAM, which addresses pure rotation issues. A notable expert in China, Wang Jing, implemented a simplified version, but it is primarily for verification purposes and contains some bugs. The code link is provided below for your learning.

– VINS

VINS-Mono and VINS-Mobile are open-source monocular visual inertial SLAM solutions developed by Professor Shen Shaojie’s team at the Hong Kong University of Science and Technology. They are classic and highly recommended monocular VIO projects. They have low overall computational requirements and decent accuracy, especially VINS-Mobile, which can run on iOS systems.

VINS-Fusion is an extension of VINS-Mono that supports multi-sensor integration, including monocular + IMU, binocular + IMU, and even pure binocular configurations. We have also collaborated with expert Qin Tong to optimize sensor integration in VINS, providing optimization solutions and support, which greatly assisted VINS during its initial release.

– OKVIS

OKVIS is a binocular VIO project released by ETH Zurich. It is also a classic binocular VIO, but it only outputs six degrees of freedom poses without loop closure detection and mapping, so it is not a complete SLAM system. Although it has good accuracy, it can experience pose drift if stationary for extended periods, which is a known issue within the project.

The code structure of OKVIS is very clear, but it is limited to localization functionality. However, it includes tight coupling and multi-sensor fusion, and the code is highly recommended for learning.

– maplab

maplab is another vSLAM framework launched by ETH Zurich following OKVIS. It consists of two parts: ROVIO and SLAM offline processing console.

– MSCKF

MSCKF is a classic vSLAM based on filter theory optimization. Its advantages include low computational requirements and decent accuracy.

Application Challenges and Solutions of vSLAM in Different Scenarios

and Solutions

Next, I will introduce some comparisons of the performance of several common VIOs and considerations for sensor selection.

The testing models used are Intel NUC, Odroid XU4, and Up Board. Here, NUC and UpBoard are both X86 architecture, while XU4 uses an ARM architecture motherboard.

In this chart, the x-axis represents pose error, with smaller errors being better. The y-axis represents CPU usage, memory usage, and processing time. From these results, VINS, OKVIS, and ROVIO show good performance. With the introduction of new algorithms, new evaluations will also emerge, so stay tuned.

– Image Sensors

For vSLAM, global shutter is preferred over rolling shutter, as global shutter exposes the entire image at once, while rolling shutter exposes it line by line. There may be a misconception that global shutter does not blur; however, blurriness is related to exposure time. In dim environments, if the automatic exposure time is extended, blurriness can still occur during movement, which also depends on the sensor’s light-sensitive area and lens aperture. Global shutter primarily addresses jelly-effect issues.

– Lenses

For lenses, a larger field of view is better for SLAM systems, as more information in the field of view allows for more feature points to be captured. However, when choosing wide-angle lenses, the distortion of the image must not be too severe, requiring matching calibration and correction models.

– Binocular Sensors

Additionally, it is crucial to ensure synchronization between binocular sensors; the left and right cameras must expose simultaneously, with the same AE (auto white balance) and AWE (auto exposure).

– Image + IMU

Strict synchronization is also required between images and IMU. The ideal situation is for IMU and images to be perfectly aligned, which is challenging, but our sensors have largely achieved this, making it a significant advantage over competitors. The second scenario involves synchronized clocks between the sensor and IMU, allowing for a fixed offset during initialization. However, the third scenario, where synchronization is not achieved, is fatal for vSLAM and cannot be utilized.

In the productization process, we also need to accelerate algorithms, including CPU instruction set acceleration, GPU acceleration via CUDA or Open CL, and acceleration through FPGA, DSP, and ASIC chips. There is a misconception that many companies can easily modify open-source projects for productization, but this is quite difficult. Algorithms require optimization for specific scenes, fusion of different sensors, and acceleration and optimization for specific platforms to reduce host configurations and costs, which is essential for successful product development.

We have also run VINS-related algorithms on the Jetson TX 2, and it is worth mentioning that VINS has a dedicated project for GPU optimization, VINS-Fusion, which significantly enhances performance on NVIDIA’s GPU products.

Applications of vSLAM in Different Scenarios

As vSLAM has developed, it has found mature applications in many fields. Let’s look at which products vSLAM has been applied to.

For different positioning, navigation, and obstacle avoidance solutions on the market, binocular + IMU products are gradually becoming the mainstream choice due to their accuracy and cost-effectiveness, while also providing recognition capabilities within visual sensors.

In the past, AR and VR primarily relied on external devices for positioning or required the deployment of QR codes in the environment, which imposed high requirements on the environment and deployment.

Now, some new VR and AR positioning products utilize VIO technology, enabling headsets to achieve autonomous positioning, focusing on high frame rates for images, with positioning frame rates needing to be high, and the sensors should not be too heavy or consume too much power. The high positioning frame rate is related to the headset’s image refresh rate, which must respond quickly to avoid causing dizziness.

– Drone Obstacle Avoidance

For drone obstacle avoidance, an increasing number of solutions are utilizing binocular systems, as binoculars can effectively serve as outdoor obstacle avoidance and navigation solutions, with higher resolutions yielding better object detection results.

– Autonomous Driving

In autonomous driving applications, most solutions rely on multi-sensor fusion, including high-precision GPS, millimeter-wave radar, cameras, LIDAR, and inertial navigation sensors. In the visual component, we focus on the dynamic range of images, ensuring cameras perform well in low-light or backlit environments (e.g., tunnels). We also prefer vehicle-grade sensors, as the internal temperature of vehicles can be very high under direct sunlight in summer.

The following image shows the positioning and navigation data set fused with GPS using VINS, demonstrating that binocular + GPS can achieve excellent results.

Practical Applications of vSLAM in Real-Time Navigation and Obstacle Avoidance in Robots

We utilized a robotic vacuum cleaner chassis for our security robot, employing our binocular camera for positioning, navigation, and obstacle avoidance. The system also incorporates a 2D LIDAR for mapping and a panoramic camera for capturing panoramic images. Let’s discuss the roles of different sensors in this system.

– Ultrasonic Sensors

Ultrasonic sensors are relatively low-precision and susceptible to external environmental interference. Their primary function here is to solve the issue of visual and LIDAR sensors being ineffective at detecting transparent objects like glass; ultrasonic sensors can effectively detect glass.

– Binocular Sensors

Binocular sensors are mainly used to provide pose information, including obstacle point clouds and depth, and they also facilitate loop closure detection. They play a crucial role in positioning, navigation, and obstacle avoidance.

– IMU Sensors

We have integrated IMU sensors with vision in our vSLAM VIO algorithms. The IMU we use is a consumer-grade model with relatively low precision but has achieved excellent results.

– LIDAR

Most products currently use single-line LIDAR due to the high cost of multi-line LIDAR. LIDAR performs well in mapping accuracy and in certain special scenarios. The number of redundant sensors and their coordination in this system should be determined based on the application context and product positioning; in some scenarios, excellent results can be achieved without LIDAR.

– Chassis Odometry

It is also important to consider the odometry of the chassis; wheel rotation is generally accurate, except in cases of slipping or uneven surfaces, which necessitates a multi-sensor fusion solution.

– Host

Currently, common hosts include GPU or X86 platforms. We primarily use NVIDIA’s embedded platforms, optimizing our algorithms through ARM + GPU integration. For cost reduction, we also port and optimize our algorithms for other platforms.

END

Your every “Like” is cherished

Related posts

Leave a Comment Cancel reply