Produced by | Zhixiang Open Class
Instructor | Yang Ruihuan, CTO of Xiaomi Intelligent
Editor | Wang Xin
Reminder | Click the blue text above to follow us, and reply with the keyword CV06 to obtain the course materials.
Introduction:
On October 17, the “Zhixiang Open Class” launched the second season of the computer vision application series, the third lecture themed “How to use vSLAM to help robots achieve precise navigation and obstacle avoidance in different scenarios,” presented live by Yang Ruihuan, CTO of Xiaomi Intelligent.
In this lecture, Yang Ruihuan started with the development history of vSLAM, introduced different cameras, and analyzed the technical principles and implementations of different algorithms, proposing solutions for various scenarios and providing application cases of vSLAM in robotics.
This article is a transcript of the open class.
Hello everyone, I am Yang Ruihuan, CTO of Xiaomi Intelligent. I am very glad to have the opportunity to communicate and learn with you! Let me introduce our company. Xiaomi Intelligent focuses on binocular vision sensors, vSLAM algorithms, and solutions. We have now launched several cameras suitable for vSLAM.
Today, I will share with you the theme of “How to use vSLAM to help robots achieve precise navigation and obstacle avoidance in different scenarios.” I will share from the following four aspects:
1. Development history of vSLAM
2. Technical principles of vSLAM and implementations of different algorithms
3. Application challenges and solutions of vSLAM in different scenarios
4. Practical applications of vSLAM in real-time navigation and obstacle avoidance in robots
Development History of vSLAM
The development history of vSLAM can also be described as the evolution of visual sensors, which are interdependent. The earliest was the monocular camera, thus the monocular vSLAM algorithm was born. The XBox Kinect camera turned structured light and ToF into consumer electronics and achieved large-scale production. A monocular camera combined with a depth camera led to RGB-D SLAM, and there are also ways to convert depth maps into Laser-Scan for mapping.
Binocular cameras have become mainstream sensors due to the increasing computing power of CPUs and GPUs in recent years. Visual odometry has high requirements for the synchronization accuracy of vision and IMU, so we also developed specialized binocular inertial cameras. A multi-camera can be simply understood as multiple pairs of binoculars, expanding the viewing angle and different directions.
Now, let me briefly introduce the characteristics of each camera:
-Monocular Camera
The biggest problem with monocular cameras is the lack of scale information, such as the proportional relationship in images. Humans can distinguish because we have undergone extensive training, similar to deep learning. However, machine vision cannot obtain spatial relationships from a single image, requiring the camera to move to estimate the relationship between images at different poses, thus obtaining scale information, which is equivalent to a variable baseline binocular camera.
-Structured Light Camera
Structured light cameras project some infrared patterns and use IR cameras to receive the image deformation to calculate depth. The distance between the IR emitter and receiver is similar to the baseline of binoculars, and the principle is similar to triangulation. Various patterns are used, such as striped patterns. Kinect uses random speckle patterns and provides three different sizes of speckles at three different distances for measuring different ranges.
-ToF Camera
The ToF camera emits continuous light pulses and uses sensors to receive reflected light to calculate distance based on the time of flight of light. The resolution of ToF cameras is generally low, typically QVGA (Quarter VGA), although VGA resolution is also available, but as the resolution increases, the cost also rises.
-Binocular Camera
The principle of binocular cameras is similar to that of human eyes. Our eyes perceive depth information by the different parallax seen by the left and right eyes, and the same principle applies to binocular cameras.
This image shows a simplified model of binocular triangulation, where P represents the object being measured, z represents the distance between the object and the camera, b represents the baseline, f represents the focal length, and d represents the parallax. Using the principle of similar triangles: z/f = b/d. In fact, the algorithm for binocular parallax is more complex, with BM and SGBM algorithms available in OpenCV.
Since binocular parallax calculation does not rely on other sensors, binocular cameras can be effectively used indoors and outdoors. The above image shows the effect observed in indoor applications. For surfaces with weak texture, such as white walls or textureless tables, we can enhance the surface texture information by overlaying some IR structured light texture, making it easier to calculate binocular parallax.
The above shows the effect of binocular cameras outdoors. Structured light cameras primarily emit and receive IR light, and sunlight contains a lot of IR light that can interfere, so most structured light cameras and ToF cameras do not work well outdoors. However, binocular cameras do not have this problem, so they are widely used in automotive ADAS.
We combine binocular structured light with IMU into a camera called binocular structured light inertial camera, achieving good application effects even in rotating conditions through multi-sensor fusion and complementarity.
Technical Principles of vSLAM and Implementations of Different Algorithms
Next, I will introduce the technical principles of vSLAM and some implementations of open-source VSLAM algorithms. The technical framework of vSLAM mainly includes sensor preprocessing, front-end, back-end, loop closure detection, mapping, etc.
-Sensor
Common sensors used include monocular, binocular, and RGBD. Due to the uncertainty of scale in monocular cameras, initialization time is generally longer, and the estimated pose accumulates errors. Binocular cameras have scale information, allowing for quick initialization, but they require specific calibration. RGBD cameras are significantly affected by outdoor sunlight, and reflections from surfaces can also interfere with RGBD cameras. If IMU is added, synchronization between IMU and images must also be considered.
-Front-end
The front-end, also known as visual odometry, estimates the motion of the camera between adjacent frames, solving the localization problem. It involves feature extraction and pose estimation algorithms.
The mentioned visual odometry only estimates the motion of adjacent frames, resulting in cumulative errors over long periods, necessitating back-end optimization and loop closure detection for closed-loop processing.
-Back-end
Back-end optimization primarily involves two types of algorithms: optimization based on filtering theory and non-linear optimization. In recent years, non-linear optimization-based vSLAM has become increasingly popular and has gradually become mainstream, but filtering theory has a longer development history and has had good application cases, so it remains very viable.
-Loop Closure Detection
Loop closure detection is similar to how humans use their eyes to determine if two places are the same. When identical points are detected, a loop closure correction is performed. However, a notable issue is that highly similar scenes may lead to false loop closures, such as walls with identical repeating textures. In such cases, additional constraints must be introduced to prevent erroneous loop closures.
-Mapping
vSLAM creates several types of maps:
(1) Feature maps store geometric feature points and keyframes, which can be used for localization and navigation or saved as maps for navigation.
(2) Topological maps, also known as statistical maps, maintain the relative positional relationships of points and lines but do not concern themselves with area, distance, or direction accuracy.
(3) Grid maps incorporate size information into a two-dimensional map. For example, the small squares in the background of the above image represent fixed unit size information, and the entire map is described according to size. Two-dimensional maps are generally suitable for ground movement without much height.
(4) Three-dimensional maps can reconstruct the three-dimensional environment and scenes, such as Octomap or point cloud maps. With these maps, we can perform three-dimensional spatial reconstruction or path planning, such as drone path planning and navigation.
Next, I will quickly introduce several open-source SLAM:
-ORB_SLAM
ORB_SLAM’s core is to use ORB as the core feature of vSLAM. It is a complete SLAM system that includes visual odometry, tracking, and loop closure detection, forming a monocular SLAM system based entirely on sparse feature points. It also provides interfaces for monocular, binocular, and RGBD cameras.
ORB features combine the FAST feature detection method and BRIEF feature descriptor, improving and optimizing them, resulting in significant enhancements in both performance and speed. The above address details the journey of our binocular product integrating ORB_SLAM, which does not include IMU calculations.
The author of ORB published an article on inertial monocular vSLAM that can solve pure rotation issues. A domestic expert, Wang Jing, briefly implemented it, but it is only for verification and has some bugs. The code address is provided below for learning.
-VINS
VINS-Mono and VINS-Mobile are open-source monocular visual inertial SLAM solutions from Professor Shen Shaojie’s team at the Hong Kong University of Science and Technology. They are classic and highly recommended monocular VIO projects. They do not have high computational requirements and provide good accuracy, especially VINS-Mobile, which can run on iOS systems.
VINS-Fusion is an extension of VINS-Mono that supports multi-sensor integration, including monocular + IMU, binocular + IMU, and even pure binocular, and it also offers a GPS version. We had the honor to discuss sensor optimization in VINS with the expert Qin Tong, providing optimization plans and support, which helped VINS achieve great sensor performance upon its initial release.
-OKVIS
OKVIS is a binocular VIO project released by ETH Zurich. It is a classic binocular VIO but only outputs six degrees of freedom pose without loop closure detection and mapping, so strictly speaking, it is not a complete SLAM. Although it has good accuracy, it may drift in pose when stationary for extended periods, which is a known issue.
The code structure of OKVIS is very clear but limited to localization functionality. However, it does include tight coupling and multi-sensor fusion, and the code is highly recommended for study.
-maplab
maplab is another vSLAM framework launched by ETH Zurich following OKVIS, consisting of two parts: ROVIO and SLAM offline processing console.
-MSCKF
MSCKF is a classic vSLAM based on filter theory optimization, known for its low computational requirements and decent accuracy.
Application Challenges and Solutions of vSLAM in Different Scenarios
and Solutions
Next, I will introduce some comparisons of the performance of several common VIOs and some considerations when selecting sensors.
The test models used are Intel NUC, Odroid XU4, and Up Board. Here, NUC and UpBoard are both X86 architectures, while XU4 uses an ARM architecture motherboard.
In this graph, the horizontal axis represents pose error, with smaller errors being better. The vertical axes represent CPU usage, memory usage, and processing time. From these results, VINS, OKVIS, and ROVIO perform well, and with new algorithms emerging, new evaluations will also follow, so stay tuned.
-Image Sensors
For vSLAM, global shutters are better than rolling shutters because global shutters expose the entire image at once, while rolling shutters expose the image line by line. There might be a misconception that global shutters do not blur; in fact, blurriness relates to exposure time. In darker environments, if the automatic exposure time is prolonged, even movement can cause blurriness, influenced by the sensor’s light-sensitive area and lens aperture. Global shutters primarily address the jelly effect issue.
-Lenses
A larger field of view is better for SLAM systems, as more information can be captured, allowing for more feature points. However, when choosing wide-angle lenses, the distortion of the image must not be too extreme, requiring a matching calibration and correction model.
-Binocular Sensors
Additionally, it is crucial to ensure synchronization between binocular sensors; the left and right cameras must expose at the same time, and both must have the same AE (automatic exposure) and AWE (automatic white balance).
-Image + IMU
Synchronization between images and IMU is also strictly required. The best situation is for IMU and images to be perfectly aligned, which is challenging but our sensors can generally achieve this, a significant reason for our superiority over competitors; the second scenario is when the sensor and IMU clocks are synchronized, allowing for a fixed offset to be estimated and stabilized during initialization; the third scenario, where they are unsynchronized, is fatal for vSLAM and cannot be used.
During productization, we also need to accelerate algorithms, including CPU instruction set acceleration, GPU acceleration using CUDA or Open CL, and acceleration using FPGA, DSP, ASIC, etc. A common misconception is that many companies hope to achieve productization by slightly modifying open-source projects, which is very challenging. Algorithms need to be optimized for scenes, fused with different sensors, and accelerated for specific platforms to reduce host configurations and costs to successfully develop products.
We have also run VINS-related algorithms on Jetson TX 2, and it is worth mentioning that VINS has a dedicated project for GPU optimization called VINS-Fusion, which enhances its performance on NVIDIA’s GPU products.
Applications of vSLAM in Different Scenarios
vSLAM has matured significantly and is now applied in many fields. Let’s take a look at the products where vSLAM has been applied.
For various positioning and obstacle avoidance solutions on the market, products combining binocular + IMU have gradually become the mainstream choice due to their accuracy and layout costs, while also providing recognition capabilities in visual sensors.
In the past, AR and VR primarily relied on external devices for positioning or deploying QR codes in the environment, which required high demands on the environment and deployment.
Now, some new VR and AR positioning products are using VIO technology, allowing headsets to achieve autonomous positioning, focusing on high frame rates for images and positioning, with sensors needing to be lightweight and energy-efficient. The high frame rate for positioning is related to the image refresh rate of the headset, which must respond quickly to avoid causing dizziness.
-Drone Obstacle Avoidance
For drone obstacle avoidance, more solutions are now utilizing binocular systems, as they effectively support outdoor obstacle avoidance and navigation, with higher resolutions improving object detection.
-Autonomous Driving
In autonomous driving applications, most solutions rely on multi-sensor fusion, which includes high-precision GPS, millimeter-wave radar, cameras, laser radar, and inertial sensors. For the visual components, we focus on the dynamic range of images, enabling cameras to perform well in low-light or backlit environments (like tunnels). We also hope to use automotive-grade sensors, as the temperature inside the car can become very high under direct sunlight in summer.
The following image shows the dataset for positioning and navigation fused with GPS in VINS, illustrating that binocular + GPS can achieve excellent results.
Practical Applications of vSLAM in Real-Time Navigation and Obstacle Avoidance in Robots
In security robots, we used a cleaning robot chassis and our binocular camera for positioning, navigation, and obstacle avoidance. At the top, a 2D laser radar is used for mapping, and a three-camera panoramic camera captures panoramic images. Now, let’s discuss the roles of different sensors in this system.
-Ultrasonic Sensor
Ultrasonic sensors are relatively low-precision sensors that are easily affected by external environments. They primarily address the issue of visual and laser sensors being ineffective against transparent objects like glass, as ultrasonic sensors can effectively detect glass.
-Binocular Sensor
Binocular sensors primarily provide pose information, including point cloud and depth for obstacle avoidance, and can also perform loop closure detection, playing a crucial role in positioning, navigation, and obstacle avoidance.
-IMU Sensor
We have integrated IMU with vision in our vSLAM VIO algorithms. The IMU we use is a consumer-grade model with lower precision, but it achieves excellent results.
-Laser Radar
Most products currently utilize single-line laser radars due to the high cost of multi-line laser radars. Laser radars perform well in mapping accuracy and special scenarios. The number of redundancies and collaborations needed for sensors in this system depends on the application scenario and product positioning; in some cases, laser radar may not be necessary and still yield excellent results.
-Chassis Odometry
It is also important to note the odometry of the chassis; the rotation of the chassis wheels is generally accurate, except in cases of slipping or uneven ground, which necessitates a multi-sensor fusion approach.
-Host
Currently, common hosts are GPU or X86-based systems. We primarily use NVIDIA’s embedded platform, optimizing our algorithms through ARM + GPU. For low-cost requirements, we also port and optimize our algorithms for other platforms.
END
Your every “Like” is appreciated!
Leave a Comment
Your email address will not be published. Required fields are marked *