Innovative Applications of Embedded AI Technology in IP Networks

Abstract

By introducing a new knowledge domain within routers, namely embedded AI hardware and software systems, and following established strategies (algorithms), network data for specified services is rapidly collected and continuously trained. Combined with online reasoning and intelligent decision-making, it achieves real-time perception, precise control, and rapid response for emerging and critical services, significantly enhancing the security, reliability, and quality of such services, thereby realizing high-quality IP networks and laying a solid intelligent foundation for autonomous driving networks.

Innovative Applications of Embedded AI Technology in IP Networks

0 1

Background

With the construction of the “East Data West Computing” project and the rapid development of global AI large models, the entire society’s digital transformation has entered the computing power era. The core components of infrastructure such as computing, storage, and networks have integrated more AI elements. Leading global operators are deepening cloud-network integration and gradually building computing power networks. As the technical foundation of the computing power network, the demands on IP networks are not only for large bandwidth and capacity but also for ultra-high reliability, high intelligence, and ultra-high security. How to utilize AI technology to upgrade and transform IP networks has become a significant topic.

Currently, the industry has not launched routers with built-in AI capabilities, nor are there corresponding products, solutions, and application cases. A provincial Unicom has organized relevant experts to form an intelligent group for IP networks and has jointly established a task force with Huawei to design hardware and software systems, conduct demand research, validate solutions, and implement them, with plans to launch related functions and solutions and test and verify them in the live network, some of which have begun commercial use. This practice holds significant meaning in both production and academic theory.

0 2

Design and Implementation of EAI System in Routers

2.1 Problems Encountered by IP Protocols

In today’s rapidly developing digital economy and new information infrastructure, IP networks are one of the most important infrastructures. Originally designed with “openness and interconnectivity” as the primary goal, IP networks have achieved great success and have effectively met the development needs of various services. The internet economy is booming, and emerging services, whether short videos, live broadcasts, or 5G carrying and computing interconnection, have increasingly high requirements for the quality and performance of IP networks, and there are still urgent problems that need to be solved in carrying these emerging services.

First, there is the issue of the perception of network status and rapid response. Based on flow detection technologies such as xFlow, network data traffic can be sampled and pushed, but considering the impact on device performance, the sampling rate is generally no more than 1:1000, which cannot accurately reflect traffic status, and the pushed data is raw and requires secondary processing and analysis, resulting in low precision and slow speed of network status perception. On the other hand, the introduction of SDN controllers brings a global perspective to IP networks. The controller monitors and manages the network based on traffic statistics, link state changes, and anomaly reports collected from network devices, while updating and adjusting network policies according to service demands. The perception of the network through SDN, such as the fastest telemetry, can achieve sub-second precision, but maintaining high-precision data display requires a large amount of data reporting, occupying the network’s outbound bandwidth, and the sampling period significantly impacts the CPU. A complete closed-loop control takes at least a minute, making timely response impossible.

Second, there are the refined requirements for IP network quality. With the support of BFD, routing protocols can converge within 100 ms, but for IP backbone network links measured in terabits, when a fault occurs, the number of lost packets can reach millions, thereby degrading network quality. Additionally, some services (such as computing power) have very high latency requirements, such as low-latency interconnection of computing power clusters. If existing network protocols are used for deployment, BGP-LS must be used to collect the entire network topology, TWAMP to measure latency, and the SDN controller to calculate low-latency paths and issue configurations via NETCONF to provide low-latency connections for computing power clusters, but these configurations are very complex and slow to adjust dynamically.

Third, there is the issue of service perception in IP networks. IP has become the universal carrying technology for various services, but traditional routers cannot perceive services, and when anomalies or faults occur, troubleshooting across multiple professional levels is very complex and time-consuming. DPI can achieve partial service perception but is costly and difficult to deploy across the entire network. Therefore, enabling the network to conveniently and economically perceive services and provide rapid support for the normal operation of services has become an important topic.

2.2 Why Introduce EAI in Routers

Traditional IP networks can only provide best-effort services. Through subsequent architecture and protocol enhancements, a certain level of perception and assurance can be provided for services, but the assurance of high-quality requirements for emerging services is clearly inadequate, often sacrificing quality to ensure service continuity in the event of anomalies.

With the vigorous development of hardware chips and AI technology, is it possible to embed AI systems in traditional IP network devices and introduce AI computing capabilities at the network element level?

First, from a hardware perspective, with the development of CPU and NP chip technologies, routers can generally utilize multi-core CPUs with more than 16 cores on their boards, allowing some CPU cores’ computing power to be borrowed for lightweight AI modeling, training, and inference. Additionally, routers use flexible programming NP chips, which can perform specific behavior flow identification and mining; therefore, the hardware capabilities of multi-core CPUs combined with programmable NP chips can achieve lightweight embedded AI. This solution uses a universal router NE5000E-20 from a certain manufacturer, requiring a CPU core count of no less than 16 cores, a clock frequency of no less than 2 GHz, and forwarding chips (NP chips) that must support flexible programmability and mining capabilities, with strong packet caching capabilities (cache bandwidth not converging, cache time no less than 30 ms), and a single board memory of no less than 16 GB. In the future, dedicated lightweight AI chips can be embedded on router boards without significantly increasing router power consumption and costs, while achieving more complete AI capabilities.

Second, from a software perspective, in the AI algorithm field, apart from developing large-scale models to enhance accuracy and generalization capabilities, the growth in model size has also brought new technical trends due to the high requirements for computing power: AI model lightweighting. With the application and development of techniques such as model distillation, teacher-student network learning, and reinforcement learning in model lightweighting, the efficiency of algorithms has improved, enabling the deployment of low-power AI algorithm applications on the device side in a more flexible manner.

It is evident that an AI general framework system embedded in network devices has become possible. This system provides model management, data acquisition, and computing functions based on AI algorithms, sending inference results to the forwarding and control planes of the network (local or SDN controller), achieving intelligent control and rapid forwarding of IP packets, which can fully utilize the device’s sample data and computing power while having low data transmission costs, high data security, and strong real-time inference decision-making advantages.

2.3 Main Design Ideas and Working Principles

Based on the main control, forwarding plane, and secondary CPU & AI chip, an EAI component system is constructed, and the system architecture and main workflow are shown in Figure 1.

Figure 1 EAI System Architecture

a) Main Control Board. Enables and disables the global EAI capability of the controller.

b) Line Card CPU Secondary Core. It is the main component of EAI, identifying, classifying, and sorting flows based on EAI algorithms, interfacing with the line card NP, collecting flow data, and outputting flow feature information to the network management or third-party platform/software.

c) Line Card NP. Responsible for flow feature statistics and reporting in the forwarding plane.

The main workflow of the system is as follows.

a) Globally control the EAI capability, providing a human-machine or machine-machine interface, enabling/disabling EAI capabilities through manual or network management configuration, and notifying the forwarding plane CPU, which informs the forwarding plane NP to start the EAI function.

b) After the forwarding plane starts the EAI function, it performs real-time flow feature statistics and reporting.

c) When the forwarding plane does not hit the flow table, the system will report the first packet and establish the flow. After successful flow establishment, it will conduct full-feature statistics on the flow and report them to the line card CPU secondary core. The forwarding chip will perform real-time flow sorting based on the entire flow and identify the TOPn traffic; the forwarding chip will collect the TOPn traffic in real time and report it to the CPU secondary core or AI chip; the EAI component of the CPU secondary core or AI chip will use the IFC intelligent flow identification algorithm to learn and model the flow, establishing a flow model library.

d) Deploy AI models that have undergone lightweight processing, allowing for quasi-real-time inference based on micro-level features collected from the forwarding plane. Depending on the application scenario, supervised classification/regression or unsupervised clustering models can be selected, and expert knowledge from various fields can be integrated according to task requirements to enhance the model’s recognition effect. For scenarios with relatively stable data and difficult-to-obtain labels, offline training can be used; for scenarios where self-supervised or otherwise obtained labeled data can be acquired, the model can be incrementally updated or trained online as needed to reduce precision loss caused by concept drift. Through AI inference, new or updated traffic profiles can be maintained, automatically extracting key information from massive traffic based on data-driven insights, outputting key information that meets established strategies or features to the control plane, which can be either the local control plane or an external third-party control system.

The embedded AI system, through a soft and hard integrated design, conducts per-packet detection on specified links or flows (five-tuple or other strategies), achieving monitoring capabilities at the ms or even μs level. By establishing normal business models through machine learning, in the event of faults or anomalies, EAI can quickly alert as a bridge between the forwarding plane and the control plane. When triggering (policy) conditions are met, EAI can interact with the control plane to achieve rapid response to business flows; or notify the device’s forwarding plane to achieve backup routing forwarding of packets, reducing packet loss; or notify the device’s control plane to achieve rapid convergence of routing; or notify the centralized control plane to achieve specific traffic policy control based on the entire network.

Of course, due to the limited computing power of the secondary CPU, it cannot monitor all business flows, but with limited computing power, it can address urgent in-network computing needs, especially significantly enhancing the security and reliability of IP networks for critical business capabilities, thereby improving the perception and assurance capabilities of IP networks for business.

2.4 Features and Advantages

Compared to traditional flow sampling analysis, the EAI system adopts a per-packet detection and analysis approach, achieving accurate modeling and allowing for real-time analysis and inference, leading to intelligent decision-making. Therefore, the EAI system can quickly and accurately detect faults or anomalies and provide rapid feedback to the control layer, optimizing the performance of IP networks at both local and global levels.

Based on the EAI framework, corresponding AI network models and application modules are developed to achieve rapid policy responses at local nodes and intelligence at the network element level. The following will detail the optimization ideas and practices of the EAI system for IP networks.

0 3

Application Ideas and Practices of EAI System in IP Networks

3.1 Enhancing the Security of IP Networks

Traditional DDoS attack detection adopts a sampling detection method, generally taking minutes to detect, which cannot timely identify the increasingly popular “short and fast” attacks (attack traffic often peaks within 10 seconds). By developing the IFC intelligent flow identification algorithm, DDoS attack traffic learning and modeling are conducted, establishing a DDoS attack model library, and the identified key information of the attack is output to the security network management system. This method reduces the identification time from minutes to seconds, effectively defending against “short and fast” attacks that traditional solutions cannot resolve.

According to data from China Unicom’s cloud shield DDoS attack monitoring platform over the past two years, the DDoS attack situation shows some new changing characteristics. First, the attack speed is accelerating, with large-scale attacks continuously accelerating to a second-level pace, and attack peak traffic rising to 800 Gbit/s to 1 Tbit/s, with the time required shrinking from 50 seconds in 2018 to 10 seconds in 2022. Second, the duration of attacks is becoming shorter, with 57% of attacks in 2022 lasting less than 5 minutes, further challenging the response speed of defense systems. Therefore, minute-level attack detection and mitigation cannot meet protection needs, and current flow-based analysis detection and diversion cleaning solutions have become a time bottleneck for defense responses.

The intelligent flow detection solution is as follows.

a) Achieve flow-level statistical data collection with millisecond-level precision of 1:1 in the device’s data plane, including flow rate, packet length, and the rate of packets containing specific types of messages, among various micro-flow characteristics. These high-frequency collected micro-flow characteristics support large-scale second-level flow state monitoring and perception (each board can support 32K to 64K IPv4 and 16K to 32K IPv6 addresses for high-speed monitoring). The packet adoption scheme is compared in Figure 2.

Figure 2 Comparison of Packet Adoption Schemes

b) Fully utilize the computing power of EAI on the control plane of the device to analyze flow micro-characteristics with millisecond-level precision in real-time. To adapt to the differential traffic models of different types of destination addresses, it learns and maintains normal business traffic model parameters at the IP granularity. By utilizing multi-dimensional features, it collaboratively detects anomalies such as sudden increases in flow rates and changes in packet composition and length aggregates, filtering out normal traffic occasional micro-sudden disturbances, and introducing a time-forgetting mechanism to address baseline drift phenomena in business traffic. The anomaly identification accuracy exceeds 95% and can provide alerts for typical DDoS attack phenomena such as sudden increases in fragmented packets and TCP SYN packets. The comparison of attack determination thresholds is shown in Figure 3.

Figure 3 Comparison of Attack Determination Thresholds

c) Perfectly compatible with existing DDoS cleaning systems. It reports flow-level DDoS attack alerts to designated devices within the network in the form of the Netstream V9 universal template, making it easy to adapt to existing network flow analysis devices and cleaning resources. The reported content includes flow rate and other statistical enhancement information, allowing downstream devices to quickly obtain flow information, achieving an end-to-end second-level disposal feedback loop to achieve near real-time cleaning feedback speed.

d) A provincial Unicom has validated the feasibility of intelligent DDoS second-level defense technology (“flash defense”). Compared to traditional DDoS attack detection, “flash defense” has faster and more accurate DDoS defense capabilities. Traditional detection technology has low sensitivity, achieving defense only 61 seconds after being attacked, resulting in prolonged business damage; while “flash defense” technology achieves attack detection in 2 seconds and completes traffic cleaning in 5 seconds, ensuring stable operation of services.

3.2 Enhancing the Reliability of IP Networks

Fiber optic failures can easily lead to network incidents and complaints, with fiber disconnection issues accounting for 61.5% of network incident complaints for a certain operator in 2021. When a fiber is interrupted, traditional routing protocol convergence time is in the minute range, and even with BFD, the convergence time requires 100 ms, during which a large number of data packets can be lost on high-speed links, impacting business perception.

The EAI system uses an innovative bandwidth pooling algorithm to perform self-detection, self-perception, and self-adjustment of data flows from three dimensions: network, network element, and link. By developing the ARK mechanism (real-time perception-module isolation-automatic recovery), it detects the consumption of system resources such as CPU and memory in real-time for each module, perceiving abnormal protocol processing modules and isolating them immediately to ensure that other modules and downstream devices’ services are not affected. Once the CPU/memory resources of the isolated module return to normal, the system automatically lifts the module isolation status, ensuring zero data loss.

When a fiber fault occurs, the system can automatically perceive the fiber link failure, automatically optimize load distribution, achieving congestion relief within milliseconds, allowing the network to withstand 10 times BGP resource overload. It is expected that network incidents and complaints caused by fiber faults will decrease by more than 80%, achieving business always online in the event of any link failure. In 2022, a major network incident occurred overseas due to the BGP protocol, which could have been largely avoided with the EAI+ARK mechanism, as illustrated in Figure 4.

Figure 4 Rapid Discovery and Protection of EAI under Simulated Current Network Load Distribution Scenarios

Microsecond-level automatic switching of traffic based on EAI technical capabilities achieves rapid fault perception, quick recovery, and rapid notification within devices, reducing fault convergence time from over 100 ms to below 100 microseconds, significantly reducing packet loss. The microsecond-level automatic switching technology only perceives changes in the flow of IP network devices without needing to perceive changes in the physical state of fiber links, switching only business traffic without impacting the control plane, ensuring that the control plane does not switch along with the business flow. Additionally, this technology supports a penalty mechanism; if frequent switching occurs within a certain period, the penalty mechanism will activate, preventing rapid switching for a period to avoid frequent business switches.

Testing environments were built in the laboratory using routers, OTN transmission devices, and testing instruments.

In the absence of enabling EAI on network devices, simulating fiber faults between OTN devices resulted in a maximum packet loss time of 18,634.11 μs. This delay is caused by the CPU perceiving the port down and sending interruption information to the control plane, which then notifies the forwarding plane to switch.

In the case of enabling EAI on network devices, simulating fiber faults between OTN devices resulted in a maximum packet loss time of 76.4 μs. This delay is the time taken for the forwarding plane to switch traffic to the normal link, which is very short, while under normal circumstances, the control plane relies on port interruption reports to converge, resulting in longer delays.

Through the above tests, EAI has reduced the switching time caused by fiber faults from 18 ms to below 1 ms, significantly reducing packet loss.

Network latency, packet loss, and jitter are the three most important parameters, and the competition among operators’ backbone IP networks currently focuses on latency. Since the network layer does not perceive packet loss and has no countermeasures, it mainly relies on the transport layer or application layer to control packet loss, with retransmission being the primary strategy, which inevitably decreases user perception during the retransmission period. The significance of this solution lies in the fact that deploying it in backbone networks can greatly reduce the probability of hidden (network protocol fault tolerance) packet loss, significantly enhancing user application perception, especially for packet loss-sensitive services, which is crucial for achieving a truly high-quality IP network.

3.3 Enhancing the Intelligent Perception and Fault Localization Capabilities of IP for Carrying Services

Traditional IP networks cannot perceive the services they carry, lacking visualization and fault localization capabilities for critical business flows. 4G/5G mobile services are the most important services on the IP carrying network. When issues arise at the business layer, the carrying network is powerless. For mobile customers, end-to-end service involves wireless, transmission, IP carrying, core network, internet, and cloud pools. Any problem in any link can potentially lead to a decline in service quality or interruption for customers. However, due to the complexity of the technical links involved, troubleshooting mobile services is very complex and inefficient. Past large-scale failures in mobile services have proven this conclusion, and how to quickly discover and localize faults has become a major topic.

The carrying network for mobile services typically uses L3 VPN over SR/SRv6. Deploying the EAI system on the P-node in the mobile core network allows for AI automatic perception and identification of 5GC signaling, thereby achieving auxiliary rapid localization and boundary definition (see Figure 5).

Figure 5 EAI’s Auxiliary Localization and Boundary Definition in Mobile Carrying Networks

The main workflow of the solution is as follows.

a) 5GC signaling flow identification. Analyze important protocols in the 5GC signaling plane, such as diameter, SS7, SOAP, Restful, and Nx signaling, and identify and detect traffic within the signaling VPN of the carrying network based on protocol numbers/port numbers.

b) Historical baseline learning. For important signaling flows, conduct historical baseline learning, historical traffic modeling, establish anomaly thresholds for traffic, and analyze communication matrix behavior. Flows can be built based on source + destination IP, and the final reported information carries protocol numbers/port numbers, including source + destination IP, with information source: entry (area PE) + exit (area PE).

c) Signaling plane anomaly alerting. Based on historical baseline learning, perceive the normal tidal variation patterns of traffic for each protocol and each IP, quickly perceive abnormal traffic changes, and alert for anomalies in one or more signaling IP flows, achieving network fault perception and localization capabilities, such as sudden increases or decreases in multi-IP traffic, and mutations in the IP communication matrix. The changes in the IP communication matrix can monitor the collective on-off state of flows and identify abnormal network elements.

d) Rapid localization and boundary definition. When anomalies occur, alerts are reported to the SDN controller, which issues IFIT and telemetry reports for the abnormal IP to determine the anomaly point, assisting in rapid localization. Currently, the 5GC carrying network is divided into two segments, namely the backbone network and the local network. Based on device capabilities, this solution plans to deploy on the P-node of the backbone network to monitor 5GC signaling traffic carried by L3 VPN/L3 EVPN, capable of localizing to the backbone network PE (city). Since the backbone network and local network VPN interconnect using Option A, the EAI flow detection on the P-node cannot localize the devices in the metropolitan area network, but the controller within the metropolitan area can combine the mobile IP address and SMF address allocation information to locate the first hop of the base station’s upstream carrying network, and then issue IFIT (in-flow detection) and telemetry reports for that link to conduct detailed fault localization (see Figure 6).

If the signaling is fully carried using SRv6, fault localization will be further simplified.

Figure 6 Anomaly Localization within the Metropolitan Area Network

In summary, through the EAI system, the carrying network has developed a mobile business profiling module that identifies 5G signaling flows, profiles mobile business, and learns business models through AI. In the event of anomalies/faults, it reports faults to the SDN controller, and with the cooperation of relevant protocols, it can quickly determine which link/network element the fault originates from, providing important assurance means for the stable and reliable operation of mobile business, with significant practical significance.

3.4 Summary

Through the in-network computing capabilities of the EAI system, without increasing network costs and complexity, IP networks become smarter, capable of timely discovering changes and anomalies in the services they carry, providing better service for business lines and operations, and offering important monitoring and service means for future large customer dedicated lines, computing power dedicated lines, and video services, significantly enhancing the security, reliability, and quality of the network, providing customers and businesses with a better experience.

0 4

Application Prospects of EAI System in IP Networks

The applications that can be developed based on the EAI system extend beyond the contents of this article. The EAI system endows line cards with inherent computing power and endless network data, enabling the development of corresponding (algorithm) modules based on network operations, control, and customer network customization needs, giving traditional IP networks intelligent wings. Furthermore, the EAI system has significant enlightening and exploratory implications for the evolution of router architecture and the completeness of the IP control plane.

Future research needs to be deepened, gradually introducing dedicated AI chips to enhance computing power, and upgrading centralized control platforms to achieve rapid and precise control over IP networks and services across the entire network through SDN+AI, thereby realizing network-level AI and laying a solid intelligent foundation for the ultimate realization of autonomous driving in IP networks.

Author Introduction

Xue Qiang, graduated from Sun Yat-sen University, senior engineer, PhD, mainly engaged in planning and construction of carrying networks, cloud pools, etc.;

Wu Meng, graduated from Peking University, engineer, PhD, mainly engaged in real-time status analysis and anomaly detection related R&D of network business traffic;

Yang Shibiao, network designer, mainly engaged in the maintenance and network security of IP networks;

Tu Libiao, graduated from Beijing University of Posts and Telecommunications, bachelor, mainly engaged in planning and construction of IP metropolitan networks and intelligent metropolitan networks;

Li Wei, graduated from Wuhan University of Technology, engineer, bachelor, mainly engaged in planning and design of backbone network solutions;

Liao Jiang, senior engineer, mainly engaged in the overall work of a provincial Unicom network BG.

Recommended Reading

Postal Design | 2024 Issue 10 Green Intelligence Computing Special

AI and Mobile Phones “Running Towards Each Other”, Injecting “Soul” into Communication

With a sense of distance, does Bluetooth technology also have precise ranging capabilities?

From Unity to UE5, exploring the game engine behind “Black Myth: Wukong”

Click “Read the Original” to download the paper PDF

Welcome to follow us by scanning the code

Toutiao Number | Postal Design Technology

Official Website | http://ydsjjs.paperopen.com

Editor | Li Xingchu Review | Yuan Jiang

Related posts

Leave a Comment Cancel reply