Edge Computing and Its Application in Video Structuring

Introduction

In recent years, with the continuous increase in security needs, the number of surveillance cameras has exploded. According to relevant predictions, by 2023, video streams will account for 80% of total internet traffic. The method of protection through video surveillance will play an important role in the security field. At the same time, people have higher demands for the performance of video surveillance systems. As a branch of computer vision, intelligent surveillance technology has become a research hotspot in recent years. Traditional manual viewing methods can hardly cope with the massive amounts of surveillance video stream data, which incurs huge labor costs. Thus, there is a need to utilize intelligent video analysis technology to automatically extract key information from video data, analyze and understand the content generated by one or more cameras in real-time, and perform tasks such as target detection and recognition simultaneously with video data generation and transmission. Since the 1970s, intelligent video analysis has been a hot topic in academia and industry. As surveillance cameras continuously generate video streams, intelligent video analysis must be able to operate in real-time. Real-time video stream analysis can be applied in fields such as intelligent security, smart cities, and autonomous driving. It can not only replace traditional manual monitoring services but also expand the application boundaries of video analysis tasks.

Faced with the explosive growth of massive video data generated by current surveillance cameras, traditional cloud computing models will face severe challenges. Data is sent back through the operator’s network and processed through the core network, which poses a great challenge to current operator networks. Given the current network capabilities, data congestion and transmission delays caused by massive data will greatly limit the real-time nature of intelligent video analysis. Additionally, security and energy consumption issues during the massive data transfer process also need to be addressed. Therefore, there is an urgent need to find a more reasonable way to solve existing problems.

In 2014, the European Telecommunications Standards Institute proposed the new concept of Edge Computing (EC), which refers to providing internet technology and cloud computing functions through wireless access networks near mobile users. Edge computing achieves data collection and analysis processing at the data source by bringing traditional cloud computing data centers closer to the network edge, ensuring the real-time processing requirements of the system while greatly optimizing network resource utilization. Video surveillance systems with edge computing utilize technologies such as computer data structures, digital signal processing, and computer vision to automatically detect various targets in the monitored scenes captured by cameras, saving a significant amount of manpower and resources while improving monitoring efficiency.

Edge Computing

Overview of Edge Computing

Edge computing concentrates storage and computing resources at the network edge, close to people, objects, mobile devices, sensors, and other data sources, providing services nearby through a platform that integrates networking, computing, storage, and applications.

In 2016, Shi Weisong and others realized that in the era of the Internet of Everything, the efficiency of cloud computing was insufficient to support the massive data generated by the increasing number of internet devices deployed at the edge. They believed that edge computing, as a new computing paradigm, could solve problems such as computing power limitations, response time requirements, bandwidth cost savings, data security and privacy, and battery life constraints by processing data at the network edge. Shi Weisong and others defined “edge” as any computing and network resource node along the path between the data source and the cloud, discussing the possibilities of edge computing in scenarios such as smart cities, video analysis, and smart homes. In the same year, Sun Xiang and others proposed a new mobile edge computing method called EdgeIoT to overcome the scalability issues of traditional IoT, which processes the raw data streams generated by massive IoT devices. EdgeIoT can significantly reduce the traffic load on the core network and the end-to-end latency between IoT devices and computing resources, facilitating the opening of IoT services.

In 2017, Satyanarayanan and others discussed the importance of edge computing and its proximity, stating that edge computing enables fast and accurate responses, enhances application scalability, achieves finer control over data for privacy protection, and reduces the impact of network or cloud failures on services. In 2019, Wang Rixin and others proposed a video surveillance system based on permissioned blockchain and edge computing, which utilizes edge computing for information collection and data processing from large-scale wireless sensors, employing distributed storage services for massive video data storage, and combining convolutional neural networks for video analysis and real-time monitoring, representing a successful application of edge computing in video surveillance. In 2020, Qi Hui and others invented a warehouse video surveillance system based on edge computing, which offloads video tasks to edge computing nodes, preprocessing the massive real-time data generated by surveillance cameras within a short time while ensuring data reliability, effectively reducing processing and transmission delays.

Mainstream Edge Computing Vendors and Their Devices

Many technology companies at home and abroad have enhanced the performance of embedded devices, making edge computing a reality.

(1) NVIDIA’s Jetson platform and GPUs can efficiently utilize GPUs for various deep learning computations. In November 2021, NVIDIA released a compact yet powerful Jetson AGX Orin, suitable for robotics, autonomous machines, medical devices, and embedded edge computing scenarios.

(2) Huawei launched the Atlas 500 smart station for edge applications, using the Ascend 310 processor. The Ascend 310 is an efficient, flexible, and programmable AI processor that employs a Da Vinci architecture AI core for deep learning acceleration. Huawei also provides the MindSpore AI framework for deep learning to simplify the development process.

(3) Bitmain’s SE5 AI computing box is a high-performance, low-power edge computing product based on chips and modules, featuring Bitmain’s self-developed third-generation TPU chip, with INT8 computing power reaching 17.6 TOPS, capable of processing 16 channels of HD video simultaneously, providing computing power for intelligent operations in various industry projects.

(4) Google’s deep learning accelerator Edge TPU can perform real-time deep learning inference for HD video with high energy efficiency.

(5) Cambricon Technologies’ NPU (Neural Processing Unit) has been successfully commercialized in Huawei’s smartphones equipped with Kirin 970 SOC, aiding real-time image intelligent optimization features in Huawei phones.

Video Structuring

Video structuring is a video intelligent analysis technology that analyzes content and extracts information from video data. The structured data is stored in a structured database for further analysis and sharing. Target detection algorithms are one of the key technologies for video structuring, commonly used to extract people, objects, and attributes from video images.

1. Overview of Target Detection Algorithms

The target detection task requires determining the types of objects contained in the input data and locating their positions. Target detection technology is not only the cornerstone of intelligent video surveillance but is also widely applied in other fields, such as image retrieval and medical image analysis. Target detection algorithms based on deep convolutional neural networks are generally divided into Two-Stage and One-Stage target detection algorithms.

Two-Stage target detection algorithms divide the task into two steps: first extracting sub-regions where targets may exist in the image, and then using a convolutional neural network for feature extraction on all sub-regions as input, followed by detection classification and bounding box regression correction. A representative algorithm is Fast R-CNN, which takes the entire image as input to extract features while introducing the concept of Region of Interest (ROI), improving detection efficiency compared to previous algorithms. One-Stage target detection algorithms directly treat bounding box predictions as regression predictions, eliminating the need for prior candidate region extraction. The original image is used as input, and the predicted results are output directly, forming an end-to-end single network framework, with YOLO being a representative algorithm. Its significant breakthrough lies in achieving a substantial increase in detection speed at the cost of only a slight decrease in accuracy.

2. Common Lightweight Networks at the Edge

In 2016, the lightweight network SqueezeNet was proposed, achieving results similar to AlexNet while using 50 times fewer parameters on the ImageNet dataset, paving the way for model compression. In 2017, Google introduced MobileNet, which constructs lightweight deep neural networks using depthwise separable convolutions, significantly reducing parameter counts while maintaining model performance, making it suitable for visual tasks on mobile and embedded platforms. Subsequent improvements led to MobileNetV2 and MobileNetV3 based on this algorithm. ShuffleNet, proposed in 2017, shares a similar idea with MobileNet, significantly reducing computational costs while maintaining accuracy, suitable for devices with very limited computing power. GhostNet, introduced in 2020, customizes the number of convolutional kernels based on original convolutions and uses simple linear transformations to extract more features. This algorithm achieves good compression effects and accuracy, deployable on edge computing devices with limited memory and computing resources.

Technical Implementation

The application of edge computing in video structuring includes two steps: model development and model deployment. Model development refers to the design and training of neural network models, where the network model typically refers to the target detection model in video structuring tasks. Model deployment refers to transforming the model and deploying it to edge computing devices.

Model Development

In video structuring tasks, the categories of people and objects required for structuring are predefined, and there are currently comprehensive large generic datasets available for training target detection models, typically employing supervised learning to learn an optimized model from the given dataset.

Model development generally consists of three stages.

1. Data Preparation: Common datasets for target detection tasks include PASCAL VOC and MS COCO, where the annotation method involves labeling the bounding boxes and category labels of targets in training images. Depending on the application scenario of the target detection task, customized datasets can also be collected and annotated as needed.

2. Network Construction: Build the neural network structure for the task. In the specific code implementation process, relevant modules such as numpy, TensorFlow, and PyTorch usually need to be imported.

3. Parameter Optimization: Design reasonable loss functions, learning rates, optimization methods, and iteration counts to train the network and obtain the best parameters, saving the network as a model file for subsequent model deployment work.

Model Deployment

The goal of model deployment is to ensure that the model can run the inference process stably with sufficient accuracy on the target hardware platform, which is a complex system engineering task. The key points of model deployment are elaborated below.

1. Model Conversion and Platform Selection.Currently, there are various edge computing device platforms available for model deployment, and different platforms support different model formats. For example, NVIDIA supports TensorRT models, while Huawei supports OM models, necessitating the conversion of the trained model to the format supported by the target platform for inference. Different tasks and models require corresponding platform-specific operators, leading to strong platform dependency.

2. Model Lightweighting.Deep learning models require significant computing, memory, and energy, which poses challenges for real-time inference or running model inference on edge devices with limited computing resources. Model lightweighting is a key focus to address this issue, determining the speed and accuracy of system inference. Due to the limited computing power of edge computing devices, model compression methods such as pruning and quantization are generally needed to make models lighter. Pruning involves removing components in the model that have minimal impact on the output. Deep learning network models often contain many redundant parameters, and many neuron activation values approach zero. Removing these neurons can maintain the same model expressive capability. Quantization refers to converting floating-point operations to integer operations, which involves converting the weights and activation values of the trained deep neural network from high precision to low precision, compressing the model while sacrificing a small amount of accuracy. However, the lightweighting process may lead to a decrease in inference accuracy, necessitating a balance between inference speed and accuracy.

3. Model Deployment.Currently, model deployment can generally be classified into two forms: pipelined deployment and service-oriented deployment. Pipelined deployment requires constructing a complete pipeline method independently, with many available SDKs such as NVIDIA’s DeepStream, Huawei’s MindX SDK, and Cambricon’s CNStream. Service-oriented deployment treats the model as an inference service, providing inference services to clients, with business logic and other operations managed by the client, such as NVIDIA’s Triton Inference Server and Baidu’s Paddle Serving. It is essential to note that the choice of deployment form and device platform must consider development difficulty, cost, system performance, as well as future system maintenance and portability.

Conclusion

Edge computing has become a research hotspot in academia and industry in recent years. Through research on edge computing, this article intuitively elaborates on edge computing and its applications in video structuring, detailing the deployment methods of target detection models on edge computing devices. Currently, real-time video stream analysis and video structuring in fields such as intelligent security and smart cities are typical application scenarios of the edge computing paradigm. The successful deployment of edge computing in these fields is a key driving force for its further development and an important support for its application in a wider range of scenarios.

■ Written by Jiang Miao, Zhao Shixian, Ren Junxing, Li Min

Institute of Information Engineering, Chinese Academy of Sciences

Related posts

Leave a Comment Cancel reply