Edge AI video terminals implement complex video data analysis functions in various edge environments (such as factories, power plants, and parks). They do not rely on cloud servers and can process video streams from multiple cameras in real-time, performing tasks such as object recognition and behavior analysis. Below, I will explain step by step from hardware architecture, data acquisition, deep learning inference, event detection, to data storage and transmission.
1. Hardware Architecture and Components
The hardware architecture of the edge terminal is the foundation of the entire system, typically including the following key modules:
1.1 Central Processing Unit (CPU)
Function: The CPU is the core of the system, responsible for managing the overall operation flow of the device, scheduling inference tasks, and executing routine data processing (such as video preprocessing, data packaging, network communication, etc.).
Selection: Commonly used embedded low-power processors (such as ARM Cortex A series, Intel Atom series) or high-performance embedded processors (such as x86 architecture).
Task Allocation: The CPU mainly handles non-AI intensive computational tasks, such as controlling camera interfaces, decision-making for rule engines, and managing communication with the cloud.
1.2 Graphics Processing Unit (GPU) or AI-Specific Acceleration Chip
Function: Executes inference tasks of deep learning models. Due to the large computational load of deep learning, GPUs (such as NVIDIA Jetson) or dedicated AI accelerators (such as Google Edge TPU, Intel Movidius NPU) are commonly used to accelerate inference.
Architecture: The AI inference module of the edge terminal typically uses low-power GPUs or NPUs, such as NVIDIA’s Maxwell or Volta architecture. These GPUs have high parallel computing capabilities, allowing for rapid processing of large-scale data in video streams.
Optimization Features:
(1) Support for low-precision calculations (such as INT8, FP16) to reduce memory and bandwidth usage while improving computation speed.
(2) Use of shard computing or model compression techniques (such as pruning, quantization) to reduce complexity and improve model inference efficiency without significantly lowering accuracy.
1.3 Video Acquisition Module
Function: Real-time acquisition of video streams from camera inputs and decoding processing. Supports various interfaces (such as USB, MIPI, CSI, HDMI) and network protocols (RTSP, HTTP).
Data Format: Common video formats include H.264, H.265, which are decoded and converted into image frames (such as RGB, YUV formats) for use by deep learning models.
Decoding Engine: An integrated hardware-accelerated video decoder (such as Video Codec Engine) can process 1080p or 4K video streams in real-time at 30fps or higher and pass them to the image preprocessing module.
1.4 Memory and Data Cache
Function: Used to cache decoded image frames, intermediate computation results of the model, and final inference output results.
Design Requirements: Typically configured with 2GB to 16GB of DDR4 or LPDDR memory to support complex inference tasks. Faster caches (such as L2/L3 Cache) are used to optimize data transfer latency.
1.5 Local Data Storage
Function: Used to store analysis results, model inference logs, video clips, and snapshots of detected events. Common storage devices include eMMC, SSD, or NVMe storage.
Management System: An integrated file management system (such as SQLite database) organizes and manages detection results and event logs. A circular buffer mechanism is typically used to manage video data, avoiding storage space issues.
1.6 Communication Module
Function: Uploads analysis results to the cloud or local server via wired or wireless networks. Typical communication methods include:
(1) Ethernet interface: Used for high-speed data transmission in fixed deployments.
(2) Wi-Fi module: Suitable for flexible deployment environments.
(3) 4G/5G module: Used for remote or outdoor environments, ensuring critical data transmission even when the network is unstable.
Communication Protocols: Edge terminals support various data transmission protocols (such as MQTT, HTTP, CoAP, WebSocket) and can interact with third-party platforms (such as industrial IoT platforms, big data platforms) for data exchange.
2. Detailed Workflow Analysis
2.1 Video Acquisition & Preprocessing
1. Camera Video Acquisition and Transmission:
(1) The camera captures video streams in real-time and transmits them to the edge box via USB or network interfaces (such as RTSP).
(2) After receiving the video stream, the edge terminal decodes it in real-time using the video decoder, converting it into image frames (usually in RGB format).
2. Image Preprocessing:
(1) The decoded image frames undergo a series of preprocessing operations:
(2) Format Conversion: Converts from YUV to RGB or BGR format to meet the input requirements of the AI model.
(3) Resolution Adjustment: Scales 1080p or 4K image frames down to a resolution suitable for the model (such as 640×480) to balance computational resources.
(4) Image Cropping and ROI Extraction: Only processes the region of interest (ROI), avoiding unnecessary calculations on irrelevant areas. For example, in traffic scenarios, only the road area is detected while ignoring pedestrian paths.
3. Image Enhancement:
Applies image enhancement techniques (such as histogram equalization, noise reduction, edge sharpening) to improve image clarity, thereby increasing the model’s detection accuracy.
2.2 Deep Learning Model Inference
1. Model Loading and Initialization:
(1) Loads different deep learning models based on the target task (such as face recognition, vehicle detection, behavior analysis).
(2) These models typically use architectures based on Convolutional Neural Networks (CNN) or Transformers and are optimized for edge devices (such as model pruning, quantization, etc.).
2. Inference Process:
(1) Image frames are fed into the input layer of the neural network model, which extracts target features through multiple layers of convolution, pooling, and activation functions.
(2) The final layer generates the category, confidence, and location parameters (Bounding Box) for each target.
(3) Non-Max Suppression (NMS) is used to remove overlapping detection boxes, ensuring that each target is detected only once.
3. Inference Result Output:
Inference results are typically output in JSON format, including the category of each target (such as “Person”), location (such as [x_min, y_min, x_max, y_max]), and confidence (such as 0.95).
2.3 Rule Engine & Event Detection
1. Rule Engine:
(1) Users can define complex rules to detect specific events (such as illegal intrusion, loitering, suspicious behavior, etc.).
(2) The rule engine performs conditional judgments based on the model’s detection results, such as “If a person is detected in a restricted area for more than 5 seconds during a specified time period,” then an alarm is triggered.
2. Event Detection and Response:
When rules are met, the terminal generates event alarms and takes appropriate actions (such as triggering local alarms, sending email or SMS notifications).
3. Action Response:
The system can adopt action response strategies (such as starting recording, generating snapshots, sending alarm signals to security platforms, etc.).
2.4 Data Storage & Communication
1. Local Data Storage:
The edge terminal saves short video clips or image snapshots of key events (such as detected intrusion behavior) locally, marked with timestamps or event IDs.
2. Data Upload and Synchronization:
Through 4G/5G or Wi-Fi networks, detection results or alarm data are uploaded to cloud servers.
3. Data Compression and Optimization:
To save bandwidth, the device performs data compression (such as JPEG compression, H.265 encoding) and uses incremental synchronization techniques (only uploading new data) to reduce data transmission overhead.
3. Conclusion
The Edge AI video recognition terminal is an integrated device capable of real-time processing and analysis of video data in edge network environments. Through reasonable hardware design, efficient data processing pipelines, and intelligent decision-making systems, it can complete various complex tasks without relying on the cloud. I hope this helps you better understand the working principles of edge AI video terminals.
Everyone is welcome to join the discussion.