Designing Communication Protocols in Embedded Systems

In a company working on projects, embedded systems are everywhere, big and small. Since they are all part of the same system, communication is necessary, and thus, protocol issues arise.

When it comes to protocols, many engineers think that the design is relatively simple, mainly focusing on message design. Most of the time, the application scenarios for protocols are simple, with no complex interactions. This approach does not pose significant problems. However, even in such simple scenarios, some protocols can encounter unexpected issues in practice. Ultimately, this is due to a lack of understanding of the rules involved in protocols. Below, we will briefly discuss the rules of protocol design.

Problems faced in protocol design:

1. Designers often start from the application perspective, considering only the basic requirements without taking into account the need for extensibility;

2. From the perspective of the OSI seven-layer model, we often design protocols from a higher-level viewpoint, neglecting the characteristics of physical layers such as RS485/RS232, I2C, CAN, and ETHERNET, leading to a lack of specificity in design for particular applications and resulting in potential issues;

3. Insufficient consideration for fault tolerance and efficiency.

Basic requirements are certainly necessary for completing the fundamental functions of the system. However, due to incomplete requirement definitions and a lack of foresight from system designers, issues arise when protocols lack version numbers and compatibility testing, leading to incompatibility between old and new product protocols, which cannot be easily resolved through software solutions. This is a common problem, and the simplest solution is to add a protocol version number in the handshake protocol to determine support for the protocol and prepare for future software compatibility. Protocols may seem to only involve message design, but just like a person, they have different roles in different contexts: forever a child in their parents’ eyes, a parent to their own children, and a friend among peers. I believe that all these aspects combined create a complete individual. UML diagrams provide different views of a system; similarly, protocols are multifaceted. The messages of a protocol are only one aspect of its static characteristics, while dynamic characteristics are equally important. For example, what happens when an error occurs? Should the message be resent? How many times? How should faulty nodes be removed from the network? What constitutes a complete communication process? What is the duration? What are the worst and best-case scenarios? Who initiates communication? Important protocols must ensure reliability, and how can we confirm that the receiver has fully and correctly received and executed the message? Often, these questions extend beyond the message itself and into the design of system solutions. Here’s a small example: in an RS485 half-duplex communication, the master sends data to the slave, hoping that the slave reliably saves the data; after the slave receives and validates the data, it writes it to its storage device and then responds to the master with either a success or failure message. However, there is an issue: RS485 is a master/slave structure and cannot send data simultaneously, meaning the master must call the slave to respond. If the write time is too long, the slave’s response time will also be prolonged; if there are many slaves, this time becomes critical. A possible modification is for the slave to immediately acknowledge receipt of the master’s message. The master can then distribute data to other slaves and, after distribution, query each slave to check whether the write was successful. Of course, some system-level solutions could be employed; as long as the slave ensures the data is successfully written after receipt, this issue becomes straightforward, and the master does not need to check for write success anymore. Software design thus becomes significantly simpler.

RS485/RS232 also supports full-duplex communication, but it is less commonly used in practice. Aside from saving wiring, the most significant reason is that RS485/RS232 does not support collision detection; either a round-robin method is used to designate a master, or a single master calls on others to speak. Therefore, communication often follows a question-and-answer method, which aligns well with half-duplex operation. However, some full-duplex applications do exist. In practical applications, half-duplex methods are still predominantly used. Protocol design here mainly considers single-point communication; multicast and unicast protocols face challenges in confirming whether the slave has successfully received data, making it cumbersome to query after multicast and broadcast communications. This complicates the software process significantly. RS485/RS232 communications have their error detection methods, such as parity checking, which is a simple error-checking mechanism but does not guarantee 100% error prevention. For reliable protocols, designing custom CRC or checksum methods may still be necessary. However, while CRC checks can be performed using lookup tables, the computational time is still greater than that of parity checks and checksums. In some real-time and low-end applications, this overhead can be excessive. Therefore, if the messages are not too large, parity checks and checksums are still feasible; if they are large, one might consider CRC8 first, then CRC16, and CRC32 thereafter, rather than applying a one-size-fits-all approach.

I2C communication is generally used only at the board level, but there is now a trend towards using it in field buses. I2C was initially designed to support multi-master and multi-slave configurations, allowing two masters to send information simultaneously, with the winning master continuing to send. However, arbitration does not imply that information can be sent bidirectionally at the same time; the roles of master and slave remain distinct, with the master addressing the slave for a response. Although modern CPUs with I2C hardware support both master and slave modes, these modes are incompatible at any given moment. Each node can only be either a master or a slave, and each communication is initiated by the master, with the slave passively receiving. This leads to no essential difference from RS232. Furthermore, the physical layer protocol of I2C dictates that its communication method is not as flexible as RS232, operating only in half-duplex mode. Two CPUs exchanging simple information can still be very useful when there are no additional RS232 resources available. Because it is board-level communication, once signal integrity is ensured, errors are unlikely to occur, eliminating the need for additional verification methods.

Ethernet and CAN bus are similar in that all nodes are equal, with no master/slave distinction; anyone can initiate communication. Collision detection allows nodes that fail arbitration to retry later, with the physical layer handling this without software involvement, greatly facilitating protocol design. For instance, in the earlier problem with a broadcast protocol, there is no longer a need to query confirmations like in RS232/RS485. Problematic devices can report issues directly, significantly simplifying problem resolution. In RS232/RS485, if a node encounters an issue, it can only report it when the master calls upon it. Ethernet/CAN does not have this limitation; when a problem arises, devices can actively report it, ensuring timely handling of issues and emergencies. If Ethernet is based on TCP protocols, efficiency may be lower, but it guarantees many features, such as data order and reliability. Protocols below the IP layer face significant challenges in ensuring data order; varying path lengths can affect the sequence of protocol delivery. Some systems design for efficiency using UDP or MAC layer communications. Therefore, it is best to adopt a conservative strategy to prevent unnecessary errors caused by the arrival order of protocol messages. The Ethernet physical layer includes CRC32 checks, making additional checks unnecessary.

The efficiency of protocols is a complex topic. Taking RS232 as an example, if there is one start bit, one stop bit, and no parity check, sending one byte requires 10 bits. At a baud rate of 9600 bps, a maximum of 960 bytes can be transmitted in one second, roughly one byte every 1 ms. If the stop bit is prolonged, the number of useful bytes transported by the protocol decreases further. The actual useful information divided by the total number of bytes transported on the bus represents the protocol’s carrying capacity. Clearly, using multicast and broadcast can significantly improve efficiency. For RS232, broadcasting may not be a good choice, especially with the need for confirmation. Increasing the baud rate might be a good idea. However, this brings the issue that a 1 Mbps communication system with one start bit, one stop bit, no parity, requires only 10 us for one byte, which is too fast for a typical CPU to handle. Therefore, DMA may be required for reception. Using DMA involves considerations of variable-length and fixed-length protocols. Variable-length protocols must dynamically determine whether a complete packet has been received, while fixed-length protocols have unparalleled advantages for high-speed RS232 communications, significantly reducing computational complexity. Fixed-length protocols involve determining the length of the protocol. We generally consider the most frequently occurring protocol length as the length for all protocol messages. For extremely long protocols used infrequently, they can be split into multiple fixed-length messages. For example, our system control command length serves as the length for all protocol messages, as 80% of the protocol messages are system control commands, while the remaining 20% are less frequently occurring messages, such as firmware upgrade messages, which are inherently large and cannot accommodate extremely long messages. These can be broken down into lengths equivalent to control command messages. Although they may seem scattered, each packet retains its independence and can be sent individually, minimizing coupling. This means that even if a large protocol is divided into smaller equal-length packets, if an error occurs in one packet during transmission, it can be resent individually without resetting the entire communication sequence. Through reasonable design, the efficiency of the protocol can be naturally enhanced.

Ethernet’s design is relatively lenient, as its underlying architecture is robust. Many tasks are handled at this level, making the length of packets less significant for Ethernet systems. The key is to address the communication model of Ethernet. If it is an embedded server, maintaining a half-open TCP connection should not consume too many resources. If it is a UDP or MAC layer protocol, the order of protocol sequences must be decoupled to prevent unnecessary issues. If a large protocol packet needs to be divided into three packets for transmission, the order of those three packets can be arbitrary without impact; if any packet encounters an error, it can simply be resent successfully. Ethernet’s high speed allows for excellent protocol carrying capacity. Additionally, because Ethernet perfectly supports multicast and broadcast, excessive use of broadcasts in practice can lead to broadcast storms, severely degrading network performance. Therefore, virtual local area networks (VLANs) have been established to mitigate broadcast storms and enhance efficiency. In actual use, systems should be designed reasonably to avoid excessive use of broadcast protocols to prevent slowing down the entire network system.

Leave a Comment