Real-Time Industrial Network Architecture Supporting Control as a Service

Real-Time Industrial Network Architecture Supporting Control as a Service

This article proposes a real-time industrial network architecture that supports Control as a Service (CaaS), decoupling control tasks from dedicated controllers and achieving virtualization of control functions. Evaluations on an actual testing platform indicate that CaaS is feasible and effective, achieving absolute packet delivery with lower latency and jitter.

Real-Time Industrial Network Architecture Supporting Control as a Service

Keywords: Industrial Internet, Flexible Manufacturing, Time-Sensitive Networking

Background

Computer network technology is currently undergoing a critical transition from the consumer internet to the industrial internet. In this transition, certain industrial scenarios require the use of open, standardized computer network protocols to replace existing closed, proprietary industrial network control protocols. This is not only an ideal but also an inherent demand arising from the development of industrial production.

In an era of relative material scarcity, the manufacturing industry primarily addresses the issue of “availability” by standardizing production processes, fixing the step-by-step transformation of raw materials into finished products through process control methods, and then inputting these programs into equipment for mass production. The downside of this “output is king” model is that the produced products are uniform. If different products are to be manufactured, each product requires its own production line or must use a single production line at different times.

As shown in Figure 1, as people’s material living standards continue to improve, production orders exhibit characteristics of customization, multi-category, and small batch sizes. At the same time, the digitalization and networking rates of industrial equipment are rapidly increasing; the computing capabilities of industrial equipment have significantly enhanced, allowing for more complex computational tasks. These changes have given rise to a new production concept called flexible manufacturing. Flexible manufacturing requires that each product can be customized while also demanding high productivity. This poses new challenges for current industrial control systems.

Real-Time Industrial Network Architecture Supporting Control as a Service

Figure 1: Trends in Industrial Internet Development

New Challenges Brought by Flexible Manufacturing

Flexible manufacturing leads to frequent switching of production lines, including the transfer, addition, and upgrading of control tasks. Transfer of control tasks refers to the movement of control tasks from one controller in the network to another; addition of control tasks refers to new control tasks introduced by new equipment or sensors; upgrading of control tasks refers to the transition from simple control tasks (described by ladder diagrams in Programmable Logic Controller (PLC) control algorithms) to complex control tasks (such as vision-based industrial defect inspection).

Taking the production of automotive glass as an example, we researched and analyzed the problems and root causes faced by manufacturers in customized production. Currently, automotive glass orders are characterized by a wide variety and small batches. The quantity of glass pieces in an order can range from a few to tens of thousands, with a significant proportion of small batch orders. To produce different types of glass, production lines need to switch frequently. Statistics show that each switch requires 10 minutes to change molds, 40 minutes for configuration and trial production, and the production line averages 6 to 8 switches per day. Assuming the production line runs 24 hours, capacity loss due to switching can reach 24%. Additionally, with the continuous upgrading of automotive products and the increasing demands from automotive manufacturers, recently, the glass production line has added three new devices, three new tasks (surface inspection, edge inspection, dimension inspection), and three types of computation methods (numerical computation, computer vision, Proportional Integral Derivative (PID) control) in the preliminary processing stage, added one device, one new task (printing inspection), and one type of computation method (computer vision) in the printing stage, and added two devices, two new tasks (curvature inspection, appearance inspection), and two types of computation methods (numerical computation, computer vision) in the quality control stage.

To address the changes in production modes mentioned above, production enterprises continuously refine their internal capabilities, work overtime, and improve management to compensate for the losses incurred during production line switches. However, overall, these efforts still fall short of the vision of flexible manufacturing. Specifically, traditional industrial control systems do not meet the requirements of flexible production, primarily due to three major issues: (1) Complicated Configuration Process: Switching production lines requires stopping operations and reconfiguring the physical connections between the PLC and the equipment, leading to decreased production efficiency; (2) Limited Scalability: The physical interface limitations of PLCs restrict the number of devices that can be added to the production line, making it difficult to meet manufacturers’ upgrade demands; (3) High Upgrade Costs: As the complexity of control tasks increases, PLCs with insufficient computing power need to be updated, resulting in additional downtime costs. The root cause of these issues lies in the deep binding between control tasks and the controllers executing them, hindering the flexible scheduling of control tasks in the network.

Controller Virtualization

In recent years, some research efforts have attempted to replace hardware PLCs in industry with virtual PLCs. For instance, Givehchi et al. established virtual PLCs in cloud servers to enhance the agility of industrial automation systems. Hegazy et al. also deployed control tasks to cloud servers and proposed an adaptive delay compensator and distributed fault-tolerant methods to improve timeliness and reliability. However, due to network delays and jitter in cloud servers, even in private clouds, the determinism of virtual PLCs is limited, and they can only meet the requirements of soft real-time applications. Therefore, critical control tasks still rely on hardware PLCs.

To enhance network determinism and reduce network latency, the IEEE 802.1 working group is advancing the development and standardization of Time-Sensitive Networking (TSN) technology. TSN is a collection of IEEE standards designed based on time synchronization, incorporating various traffic shaping methods to ultimately achieve low-latency, deterministic data communication services. Additionally, to further increase reliability, TSN provides multiple technologies, including frame replication and elimination, to enhance redundancy. TSN also supports unified configuration protocols for standardized network management. However, IEEE TSN is a local area network technology developed based on Ethernet protocols and cannot achieve real-time and reliable networking and control systems on a larger scale.

In light of the limitations of cloud-based PLCs and TSN, we have chosen a different path from cloud-based PLCs, virtualizing controllers from cloud servers down to the network, allowing control functions to be deployed flexibly on network switches, making control functions a service inherent to the network.

CaaS Network Architecture

Overall Architecture

We propose a Control as a Service(CaaS) technology, decoupling control tasks from dedicated controllers, collaboratively scheduling control tasks and network traffic, and flexibly deploying control tasks to any switch in the network. As shown in Figure 2, CaaS technology virtualizes the entire industrial control network into a generic controller, achieving virtualization of control functions and supporting flexible switching and upgrading of production lines.

Real-Time Industrial Network Architecture Supporting Control as a Service

Figure 2: Control as a Service Network Architecture

Specifically, the real-time industrial network architecture supporting Control as a Service is primarily divided into four layers from bottom to top:

1. Network terminal device plane, which includes a large number of sensors and actuators.

2. Computing forwarding plane, responsible for completing both computing and forwarding tasks. In traditional networks, switches are only responsible for forwarding tasks, while the new CaaS switches can also perform computing tasks—as virtual controllers—to complete industrial control tasks.

3. Control plane, which centers around the “task-network collaborative scheduler”, coordinating the scheduling of control task allocation (when and where control tasks are computed on which switch) and data traffic transmission (when and via which path data traffic is transmitted).

4. Application plane, which includes various industrial control tasks. As a whole, the CaaS network schedules control tasks to be executed on any switch in the network based on the specific needs of industrial control tasks and network capabilities, and sends the computation results (i.e., control commands) in real time to the execution devices.

The core of CaaS is the computing forwarding plane and the control plane. The core of the computing forwarding plane is the CaaS switch, responsible for executing control tasks and forwarding tasks. The core of the control plane is the task-network collaborative scheduler, which uniformly coordinates and schedules control tasks and data forwarding.

Computing Forwarding Plane—CaaS Switch

The CaaS switch employs a collaborative design of hardware and software, utilizing programmable logic (PL) in the hardware part for high-speed transmission and precise timing; in the processing system (PS) part, software is used to implement complex computing tasks, including PLC runtime, configuration clients, and time synchronization. The design challenge of the CaaS switch is to ensure end-to-end determinism of data from input devices (such as sensors) to output devices (such as actuators). As shown in Figure 3, we subdivide this determinism into three aspects: network transmission determinism, hardware-software data exchange determinism, and computation determinism.

Real-Time Industrial Network Architecture Supporting Control as a Service

Figure 3: CaaS Switch Design

We use Time-Sensitive Networking to ensure network transmission determinism, design a dual Direct Memory Access (DMA) mechanism to ensure hardware-software data exchange determinism, and employ core isolation and global time-aware task computation to ensure computation determinism. Among these, the dual DMA mechanism transmits CaaS metadata and time-sensitive control data through different channels, avoiding interference from non-time-sensitive data during critical data exchanges between hardware and software. The core isolation mechanism isolates the configuration client, time synchronization, other switch functions, and PLC processes, ensuring determinism in task execution duration through dedicated CPU cores. The global time-aware task computation obtains high-precision global synchronized time from the real-time clock (RTC) module of the PL, controlling the start time of tasks according to specified times in each cycle, avoiding the impact of unsynchronized CPU times between different devices.

Control Plane—CaaS Task and Traffic Scheduling

Task-Network Collaborative Scheduler

In the control plane, the task-network collaborative scheduler needs to reasonably schedule network traffic and tasks to ensure the determinism of data transmission and computation.

CaaS’s collaborative scheduling differs from TSN’s traffic scheduling. TSN only schedules network traffic, while CaaS collaborates to schedule control tasks and network traffic, making the scheduling objectives more complex and involving more scheduling parameters.

The scheduling algorithm needs to specify where each computing task will be executed on which switch, the start and end times of each task execution, and when data flows through which network link. We model the collaborative scheduling of tasks and traffic as a Satisfiability Modulo Theories (SMT) problem, and then solve it using SMT solvers.

Figure 4 provides an example of task scheduling, where there are two computing tasks (Task 1 & Task 2). Task 1 uses D1 as the input device and D4 and D5 as output devices. The right side shows the specific scheduling results for Task 1, which is assigned to execute on switch SW2. The input data flow for Task 1, which is from D1 to SW2, passes through the links (D1, SW1) and (SW1, SW2). The time slots reserved for this data flow on these two links are indicated in Figure 4.

Real-Time Industrial Network Architecture Supporting Control as a Service

Figure 4: Task-Network Collaborative Scheduling

Traffic Scheduling Based on Deep Reinforcement Learning

A fast and efficient global scheduling algorithm can enhance CaaS’s capability to support critical traffic and tasks. Although the TSN/CaaS collaborative scheduling can be transformed into an SMT problem for solving, this problem is NP-hard, and general solvers like z3 and Gurobi can only handle small-scale data flow scheduling problems. When the data flow scale exceeds 1000, general solvers may take hours or even days to solve, which is unacceptable for industrial scenarios that prioritize efficiency. Some research has proposed heuristic-based scheduling schemes that attempt to utilize expert knowledge to find feasible solutions to TSN scheduling problems within a fixed time budget. While these methods can accelerate problem-solving speed, they rely on many manually designed simple rules that may not always find the most effective search strategies and have poor generalizability. We observe that although scheduling problems are difficult, confirming the correctness or satisfiability of potential schedules is much simpler. This provides an opportunity to use Reinforcement Learning (RL) frameworks to conduct a series of trial-and-error learning from the implicit data distribution and derive better search strategies.

We propose DeepScheduler, a fast and scalable TSN/CaaS scheduling method based on deep reinforcement learning. Unlike previous methods that heavily rely on expert knowledge or problem-specific assumptions, DeepScheduler can automatically learn effective scheduling strategies from the complex relationships between data flows. As shown in Figure 5, DeepScheduler treats TSN/CaaS scheduling as a multi-step decision problem, where a neural network-based agent takes the current state of the problem as input and outputs scheduling actions. The environment model maintains the current state based on the agent’s output and provides feedback rewards to guide model training. To address the challenges faced by deep reinforcement learning in TSN/CaaS scheduling, we propose a scalable state information encoder based on Graph Neural Networks (GNN), adaptively modeling the complex relationships between network topology and link states; we design a path-based flow-aware encoder that combines the current network state and traffic demands from the perspective of possible routing; we decompose the scheduling decision space into two interdependent sub-tasks to reduce the problem size for each RL agent, enabling effective exploration of the entire search space. Additionally, we introduce hard sample mining and incremental training techniques to stabilize the model training process.

Real-Time Industrial Network Architecture Supporting Control as a Service

Figure 5: Network Scheduler Based on Deep Reinforcement Learning

We tested the DeepScheduler based on deep reinforcement learning against various existing scheduling methods on simulation and physical testing platforms. The experimental results show that compared to existing heuristic search-based (taboo search, genetic algorithm) and expert knowledge-based (data flow priority, time slot allocation methods) scheduling methods, DeepScheduler is over 150 times faster and 5 times faster, respectively, completing all test case solutions in seconds; the schedulability improved by 36% and 39%, respectively. The efficiency and high success rate brought by DeepScheduler make the traffic scheduling problem no longer a barrier to flexible manufacturing.

Design and Implementation of CaaS Switch

To realize the CaaS network architecture, we independently developed the “Ziggo” CaaS industrial switch and testing analysis tool (http://tns.thss.tsinghua.edu.cn/ziggo/). This switch adopts a hardware-software collaborative design based on FPGA, fully supporting IEEE 802.1AS, Qav, Qbv, Qcc and other TSN protocols, supporting the co-network transmission of IT and OT traffic, achieving deterministic forwarding of critical data traffic and ultra-low latency transmission, achieving nanosecond-level time synchronization and microsecond-level per-hop delay jitter.

Real-Time Industrial Network Architecture Supporting Control as a Service

Figure 6: “Ziggo” CaaS Industrial Switch

The “Ziggo” CaaS industrial switch is implemented based on the Xilinx ZYNQ-7000 System on Chip (SoC). As shown in Figure 6, this SoC combines the hardware programmability of FPGA with the software programmability of ARM-based processors, aligning with the design philosophy of CaaS. The switching logic structure is implemented in hardware using Verilog, with each port in the switch structure having three PS-PL interaction channels, used for control task I/O data, CaaS metadata, and background traffic, respectively. The traffic shaper is implemented using BRAM, controlling the sending status of each queue based on the scheduling results issued by the control plane. We implemented the time synchronization protocol based on IEEE 802.1AS, with timestamps and high-precision clocks running in PL, while synchronization-related algorithms run in the PS part. To support deterministic computation, we replaced the hardware I/O layer of the open-source project OpenPLC with our packetized PLC I/O layer. We also connected the real-time clock module of PL to PS via an AXI-lite interface, enabling a Linux driver to access globally synchronized time. Meanwhile, we performed core isolation in the CaaS Linux system to ensure the execution of computing tasks.

CaaS System Performance Testing

We conducted network configuration and system performance testing on the CaaS switch under different network topologies and control task requirements. The goal of CaaS is to achieve a flexible industrial control system that meets the deterministic requirements of critical control tasks. Since the systems of general PLCs and industrial switches are closed and proprietary, they cannot be directly compared with CaaS. Therefore, we constructed two baseline methods based on TSN and OpenPLC with single DMA. Among them, the Baseline w/ GTA method also incorporates global time-aware task computation capabilities.

As shown in Figure 7, we conducted 50 experiments and over ten thousand samples under the A380 and ring network (Ring6) topologies. Overall, CaaS achieved a zero packet loss rate and reduced latency by 42% to 45%, while network transmission jitter decreased by three orders of magnitude.

Real-Time Industrial Network Architecture Supporting Control as a Service

Figure 7: CaaS System Performance Testing

Figures 7(b) and 7(f) show the packet loss rate results. In both topologies, the packet loss rate of Baseline w/o GTA equals 1, indicating that no packets can be transmitted as planned. This is because the PLC cannot obtain global synchronized time. Therefore, we introduced the second baseline method Baseline w/GTA. The cumulative distribution function (CDF) indicates that CaaS achieved 0 packet loss in both test beds, while the 99th percentile packet loss rate for Baseline w/GTA was 3.18% (A380) and 2.47% (Ring6).

The distribution of average latency is shown in Figures 7(c) and 7(g). CaaS exhibited a more concentrated latency distribution and lower average latency compared to the baseline. For example, in Figure 7(c), the latency for most tasks falls within a narrow range of around 1.31 ms, while the baseline w/GTA’s latency is uniformly distributed between 1.46 ms and 3.24 ms. On average, CaaS’s latency is 42% to 45% lower than the baseline. This improvement is attributed to two reasons. First, the task-network joint scheduling algorithm can globally optimize scheduling by simultaneously adjusting task scheduling and traffic scheduling, while the two-step scheduling method can only find local minima. Second, the computing forwarding plane technology ensures the determinism of PL-PS communication and task execution. Therefore, CaaS can perfectly follow data flow planning, while baseline methods always deviate from the plan.

As shown in Figures 7(d) and 7(h), the jitter for CaaS is almost negligible compared to baseline methods. Specifically, CaaS’s median jitter is 4.6μs and 3.7μs for A380 and Ring6, respectively, while the jitter for baseline methods is 5.4 ms and 5.6 ms, differing by three orders of magnitude. The main reason is the lack of core isolation in the baseline, leading to uncertainty in task execution times.

Conclusion

This article presents a real-time industrial network architecture supporting CaaS, decoupling control tasks from dedicated controllers, and flexibly deploying them to any switch in the network according to the collaborative scheduling of control tasks and network traffic, virtualizing the entire industrial control network into a generic controller to achieve virtualization of control functions. This article employs a collaborative design approach for both hardware and software, developing network switches and protocol stacks that support CaaS, and designing a series of mechanisms that ensure network determinism, computation determinism, and hardware-software data interaction determinism, ultimately guaranteeing end-to-end determinism of industrial control tasks from input to output. CaaS will drive industrial control networks towards decentralization, virtualization, and service-oriented directions.

Real-Time Industrial Network Architecture Supporting Control as a Service

Yang Zheng

Senior Member of CCF. Associate Professor and PhD Supervisor at Tsinghua University. IEEE Fellow. Main research directions include IoT and Industrial Internet.

[email protected]

Real-Time Industrial Network Architecture Supporting Control as a Service

Zhao Yi

Student Member of CCF. PhD student at the School of Software, Tsinghua University. Main research directions include IoT, edge computing, and time-sensitive networking.

Real-Time Industrial Network Architecture Supporting Control as a Service

He Xiaowu

Student Member of CCF. PhD student at the School of Software, Tsinghua University. Main research directions include Industrial Internet and edge computing.

Other authors: Dang Fan

Special Statement: The China Computer Federation (CCF) holds all copyrights to the content published in the “Communications of the China Computer Federation” (CCCF). Without CCF’s permission, it is prohibited to reproduce any text or photos from this publication; otherwise, it will be considered an infringement. CCF will pursue legal responsibility for any infringement behavior.

Real-Time Industrial Network Architecture Supporting Control as a Service

Leave a Comment