vSLAM Development Guide: From Technical Framework, Open Source Algorithms to Hardware Selection!

Click the “Computer Vision Life” above, and select “Star”

Quickly get the latest dry goods

Introduction:

On October 17, Zhizhongxi Open Class organized a lecture titled “How to Use vSLAM to Help Robots Achieve Accurate Navigation and Obstacle Avoidance in Different Scenarios”, presented live by Yang Ruihuan, CTO of Xiaomi Intelligent.

In this lecture, Yang Ruihuan started from the development history of vSLAM, introduced different cameras, analyzed the technical principles and different algorithm implementations, proposed solutions for various scenarios, and provided application cases of vSLAM in robotics.

This article is a transcript of this open class.

Hello everyone, I am Yang Ruihuan, CTO of Xiaomi Intelligent. I am very glad to have the opportunity to communicate and learn with you! First, let me introduce our company. Xiaomi Intelligent is a company focused on binocular vision sensors, vSLAM algorithms, and solutions, and has launched multiple cameras suitable for vSLAM.

Today, I will share with you the theme of “How to Use vSLAM to Help Robots Achieve Accurate Navigation and Obstacle Avoidance in Different Scenarios”. I will share from the following four aspects:

1. Development history of vSLAM

2. Technical principles and different algorithm implementations of vSLAM

3. Application challenges and solutions of vSLAM in different scenarios

4. Practical application of vSLAM in real-time navigation and obstacle avoidance for robots

Development History of vSLAM

The development history of vSLAM can also be regarded as the evolution of visual sensors, and the two are interdependent. The earliest was the monocular camera, which led to the monocular vSLAM algorithm. The XBox Kinect camera turned structured light and ToF into consumer electronics and achieved mass production. A monocular camera combined with a depth camera resulted in RGB-D SLAM, and depth maps were converted into Laser-Scan to perform mapping.

vSLAM Development Guide: From Technical Framework, Open Source Algorithms to Hardware Selection!

Due to the high computational requirements of binocular cameras, and thanks to the increasing power of CPUs and GPUs in recent years, they have gradually become mainstream sensors. Visual odometry has high synchronization accuracy requirements for vision and IMU, so we also developed specialized binocular inertial navigation cameras. Multi-camera systems can be simply understood as multiple sets of binoculars, expanding the field of view and directions.

Next, I will briefly introduce the characteristics of each camera:

– Monocular Camera

The biggest problem with monocular cameras is the lack of scale information, such as the proportional relationship in images. Humans can distinguish because we have undergone extensive training, similar to deep learning. However, machine vision cannot derive spatial relationships from a single image, requiring the camera to move to estimate the relationships of images at different poses, thus obtaining scale information, which is equivalent to a variable baseline binocular camera.

– Structured Light Camera

Structured light cameras project some infrared patterns and calculate depth by utilizing the deformation of the image received by IR cameras. The distance between the IR emitter and receiver is similar to the baseline of binocular cameras, and the principle is similar to triangulation. Various patterns are used, such as striped patterns. Kinect uses random speckles and provides three different sizes of speckles at three different distances for measuring different ranges.

– ToF Camera

ToF cameras emit continuous light pulses and use sensors to receive reflected light, estimating distances based on the time of flight of light. The resolution of current ToF cameras is generally low, typically QVGA (QuarterVGA), although VGA resolution is also available now. However, as its resolution increases, the cost also rises.

– Binocular Camera

Binocular cameras operate similarly to human eyes. Our eyes perceive depth information through the different parallax seen by the left and right eyes, and binocular cameras work on the same principle.

This image shows a simplified model of binocular triangulation. P represents the object being measured, z represents the distance between the object and the camera, b represents the baseline, f represents the focal length, and d represents the parallax. Using the principle of similar triangles: z/f = b/d. In fact, the algorithm for binocular parallax is more complex, as OpenCV includes BM and SGBM algorithms.

Since the calculation of binocular parallax does not rely on other sensors, binocular cameras can be applied well both indoors and outdoors. The above image shows effects seen in indoor applications. For some low-texture surfaces, such as white walls or untextured tables, we can enhance the surface texture information by overlaying some IR structured light textures, making it easier to calculate binocular parallax.

The above shows the effect of binocular cameras outdoors. Structured light cameras mainly emit and receive IR light, and the sunlight contains a lot of IR light that can interfere, so most structured light cameras and ToF cameras do not work well outdoors, but binocular cameras do not have this problem. Therefore, many ADAS systems in cars now use binocular cameras.

We integrate binocular structured light and IMU into one camera called a binocular structured light inertial navigation camera, achieving good application effects in different textures both indoors and outdoors, even under rotational conditions.

Technical Principles and Different Algorithm Implementations of vSLAM

Next, I will introduce the technical principles of vSLAM and some implementations of open-source VSLAM algorithms. The technical framework of vSLAM mainly includes sensor preprocessing, frontend, backend, loop detection, mapping, etc.

– Sensors

The sensors we often use include monocular, binocular, and RGBD. Due to the uncertainty of scale with monocular cameras, the initialization time is generally longer, and the estimated pose has cumulative errors. In contrast, binocular cameras have scale information, allowing for quick initialization, but they require certain calibration. RGBD cameras are significantly affected by outdoor sunlight, and reflective surfaces can also interfere with RGBD cameras. If IMU is added, synchronization between IMU and images must also be considered.

– Frontend

The frontend, also known as visual odometry, primarily estimates the motion between adjacent frames and the motion of the camera between frames, addressing localization issues, involving feature extraction and pose estimation algorithms.

The visual odometry mentioned here can only perform motion estimation for adjacent frames, which is a local estimation. Over time, cumulative errors can occur, necessitating backend optimization and loop detection for closed-loop processing.

– Backend

Backend optimization mainly consists of two types of algorithms: optimization based on filtering theory and nonlinear optimization. In recent years, nonlinear optimization-based vSLAM has become increasingly popular and gradually mainstreamed. However, filtering theory has a longer development history and has had excellent application cases, making it very viable.

– Loop Detection

Loop detection operates similarly to how humans use their eyes to determine if two places are the same. When identical points are detected, a loop correction is performed. However, there is a problem to note; highly similar scenes may lead to false loops, such as walls with identical repetitive textures. In such cases, additional constraints need to be added to avoid erroneous closures.

– Mapping

Maps established by vSLAM can be of several types:

(1) Feature maps store geometric feature points and keyframes, which can be used for localization and navigation or saved as maps for navigation.

(2) Topological maps, also known as statistical maps, maintain the relative positional relationships of points and lines but do not care about area distance and direction correctness, serving as an abstract map.

(3) Grid maps are two-dimensional maps that incorporate size information. For instance, each grid in the background of the image above represents a fixed unit of size, and the entire map describes dimensions. Two-dimensional maps are primarily suitable for ground movement without much height for robots.

(4) Three-dimensional maps can restore the three-dimensional environment and scenes, such as Octomap or point cloud maps. With this map, we can perform spatial three-dimensional reconstruction or path planning, such as path planning and navigation for drones.

Next, I will briefly introduce several open-source SLAM:

– ORB_SLAM

ORB_SLAM’s core is to use ORB as the core feature of vSLAM. It is a complete SLAM system that includes visual odometry, tracking, loop detection, and is a monocular SLAM system entirely based on sparse feature points. It also has interfaces for monocular, binocular, and RGBD cameras.

ORB features combine the FAST feature detection method with the BRIEF feature descriptor, improving and optimizing based on their original foundations, significantly enhancing both performance and speed. The address above shows the journey of our binocular product integrated with ORB_SLAM, which does not include IMU calculations.

The author of ORB published an article on an inertial navigation-based monocular vSLAM, which can solve pure rotation problems. The domestic expert Wang Jing has made a simple implementation, but this is merely for verification, and there are still some bugs. The following image contains the code address for everyone to learn.

– VINS

VINS-Mono and VINS-Mobile are monocular visual inertial SLAM solutions open-sourced by Professor Shen Shaojie’s team from the Hong Kong University of Science and Technology. They are classic and, personally, I highly recommend this monocular VIO project. It has low overall computational requirements and decent accuracy, especially VINS-Mobile, which can run on iOS.

VINS-Fusion is an extension of VINS-Mono, supporting multi-sensor integration, including monocular + IMU, binocular + IMU, and even pure binocular, while also providing a version that adds GPS. We are fortunate to have discussed sensor optimization on VINS with the great Qin Tong, providing optimization schemes and support, which helped VINS have a good sensor at the time of its early release.

– OKVIS

OKVIS is a binocular VIO project released by ETH Zurich, which is also a classic binocular VIO. However, it only outputs six degrees of freedom poses and lacks loop detection and mapping, so it is not strictly a complete SLAM. Although it has good accuracy, if it remains stationary for a long time, pose drift can occur, which is a problem within this project.

OKVIS’s code structure is very clear but is limited to localization functions. Of course, it also includes tight coupling and multi-sensor fusion, and the code is highly recommended for everyone to study.

– maplab

maplab is another vSLAM framework launched by ETH Zurich after OKVIS, which consists of two parts: ROVIO and the SLAM offline processing console.

– MSCKF

MSCKF is a classic vSLAM based on filtering theory optimization. Its advantages are low computational requirements and decent accuracy.

Application Challenges and Solutions of vSLAM in Different Scenarios

and Solutions

Next, I will introduce some comparisons of the performance of several common VIOs and some considerations when selecting sensors.

The models used for testing comparisons are Intel NUC, Odroid XU4, and Up Board. Here, NUC and UpBoard are both X86 architecture, while XU4 uses an ARM architecture motherboard.

In this figure, the horizontal axis represents pose error, the smaller the error, the better. The vertical axes represent CPU usage, memory usage, and processing time. From this result, the performance of VINS, OKVIS, and ROVIO is quite good. With the emergence of new algorithms, new evaluations will also be available, so please keep an eye on it.

– Image Sensors

For vSLAM, global shutters are better than rolling shutters, as global shutters expose the entire image at once, while rolling shutters expose it line by line. There may be a misconception that global shutters do not cause blurring; however, this is not true. Whether blurring occurs depends on the exposure time. In darker environments, if the automatic exposure time is extended, movement can still cause blurring, which is related to the sensor’s sensitivity area, lens aperture, and other factors. Global shutters primarily address the jelly effect issue.

– Lenses

For lenses, a larger field of view is better for SLAM systems because more information in the field of vision means more feature points can be captured. However, when selecting a wide-angle lens, the image distortion should not be too severe, and it must match the calibration and correction model for use.

– Binocular Sensors

Additionally, it is crucial to ensure synchronization between binocular sensors; the left and right eyes must be exposed simultaneously, with the same AE (automatic exposure) and AWE (automatic white balance) during exposure.

– Image + IMU

There are strict synchronization requirements between images and IMUs. The best situation is for the IMU and images to be perfectly aligned, as in the first case. This is challenging, but our sensors can basically achieve this, which is a critical reason we outperform some competitors in the market. The second scenario is that the sensor and IMU clocks are synchronized, meaning their offsets are relatively fixed. This is also usable because we can estimate and fix the offset during initialization. However, the third scenario is unsynchronized, which is fatal for vSLAM and cannot be used.

During productization, we also need to accelerate algorithms, including CPU instruction set acceleration, GPU acceleration via CUDA or Open CL, as well as FPGA, DSP, and ASIC chip acceleration. There is a misconception that many companies hope to achieve productization by slightly modifying open-source projects, but this is very difficult. Algorithms need to be optimized for scenes, fused with different sensors, and also accelerated and optimized for specific platforms to reduce the mainframe configuration and costs to make the product successful.

We have also run VINS-related algorithms on the Jetson TX 2, and it is worth mentioning that VINS has a dedicated project, VINS-Fusion, for GPU optimization, which has improved its performance on NVIDIA’s GPU products.

Applications of vSLAM in Various Scenarios

As vSLAM has developed, it has found mature applications in many fields. Let’s take a look at which products have applied vSLAM.

For various positioning, navigation, and obstacle avoidance solutions currently on the market, binocular + IMU products have gradually become the mainstream direction due to their accuracy and layout costs, while also providing recognition capabilities within visual sensors.

Previously, AR and VR primarily used external devices for positioning or deployed QR codes in the environment, but the requirements for the environment and deployment were quite high.

Now, some new VR and AR positioning products are using VIO technology, allowing headsets to achieve autonomous positioning, primarily focusing on high frame rates for images, and the positioning frame rate must be high. The sensors should not be too heavy, and power consumption should also be low. The high positioning frame rate is related to the image refresh rate of the headset, as it must respond quickly to avoid causing dizziness.

– Drone Obstacle Avoidance

For drone obstacle avoidance, more solutions are now utilizing binocular systems because binoculars can effectively serve outdoor obstacle avoidance and navigation, and higher resolution improves object detection.

– Autonomous Driving

In autonomous driving applications, most rely on multi-sensor fusion solutions, including high-precision GPS, millimeter-wave radar, cameras, laser radars, inertial navigation, and other sensors. For the visual part, we are particularly concerned about the dynamic range of images, allowing cameras to perform well in low-light or backlight environments (e.g., tunnels). We also hope that cameras can use automotive-grade sensors since the temperature inside vehicles can be very high in direct sunlight during summer.

The following image shows the positioning and navigation of a dataset fused with GPS using VINS, demonstrating that binocular + GPS can also achieve good results.

Application of vSLAM in Real-time Navigation and Obstacle Avoidance for Robots

Practical Applications

For security robots, we used a chassis from a sweeping robot, using our binocular camera for positioning, navigation, and obstacle avoidance. At the top, there is a 2D laser radar for mapping, and a panoramic camera with three cameras for capturing panoramic images. Next, we will discuss the roles of different sensors in this system.

– Ultrasonic Sensors

Ultrasonic sensors are not very accurate and are easily influenced by external environments. Here, they primarily address the issue of visual and laser sensors failing to recognize transparent objects like glass, as ultrasonic sensors can effectively detect glass.

– Binocular Sensors

Binocular sensors are mainly used to provide pose information, including point clouds and depth for obstacle avoidance, and also perform loop detection. Their main functions are positioning, navigation, and obstacle avoidance.

– IMU Sensors

We have fused IMU with vision and incorporated it into the vSLAM VIO algorithms. The IMU we use is a consumer-grade IMU, which is not very precise, but has achieved excellent results.

– Laser Radar

Most current products use single-line laser radars because multi-line laser radars are too expensive. Laser radars perform well in mapping accuracy and in specific scenarios. The amount of redundancy and coordination required for this system’s sensors needs to be determined based on the application scenario and product positioning. In some scenarios, laser radars may not be necessary to achieve good results.

– Chassis Odometry

Additionally, it’s important to note that the chassis odometry is relatively precise, but issues like slipping or uneven ground can occur, necessitating a multi-sensor fusion solution.

– Main Unit

Currently, common main units are GPU or X86 platforms. We typically use NVIDIA’s embedded platform and optimize our algorithms through ARM + GPU. For cost reduction, we also look to port and optimize on other platforms.

Learn the core technology of 3D vision SLAM from scratch, scan to view the introduction, and get a full refund within 3 days without conditions.

Learning is an advantage. Avoid solo learning; here you can find tutorials, exercises, Q&A, and a high-quality learning circle to help you get started quickly!

Communication Group

Welcome to join the reader group of the public account to exchange with peers. Currently, we have WeChat groups on SLAM, detection segmentation recognition, 3D vision, medical imaging, GAN, autonomous driving, computational photography, algorithm competitions (which will gradually be subdivided),Please scan the WeChat number below to join the group, and note: “Nickname + School/Company + Research Direction”, for example: “Zhang San + Shanghai Jiao Tong University + Visual SLAM”. Please follow the format for notes; otherwise, you will not be approved. After successfully adding, you will be invited to relevant WeChat groups based on your research direction. Please do not send advertisements in the group; otherwise, you will be removed. Thank you for your understanding~

For submissions, please contact: [email protected]

Long press to follow Computer Vision Life

Welcome to join the knowledge planet to learn SLAM from scratch, see: how to systematically learn visual SLAM from scratch?

vSLAM Development Guide: From Technical Framework, Open Source Algorithms to Hardware Selection!

Communication Group

Recommended Reading

Leave a Comment Cancel reply

Communication Group

Recommended Reading

Related posts

Leave a Comment Cancel reply