Several Undesirable Structures/Strategies in Railway Safety Embedded Software

In today’s context of empowering and innovating railway products, even safety products, discussing such fundamental issues seems like crawling out of a pile of old papers. However, the designs we occasionally see remind us that while running smoothly is indeed impressive, ensuring stability is always the most basic and important thing. From a technical perspective, achieving stability in safety has likely been met by major domestic signal suppliers over a decade ago. Yet, for various reasons—perhaps due to a long absence of significant accidents, or the necessity to survive in a competitive environment, or the lack of knowledge dissemination, or the fearlessness of newcomers—the actual solutions, human errors, design levels, implementation results, subjective initiative, and personnel experience in safety products have likely declined in recent years. Therefore, let the old papers remain as they are; there are indeed several undesirable structures/strategies in safety software that are worth mentioning. It is important to note that a non-recommended structure does not mean it is unworkable; it simply has more room for systemic failures. The issues identified are primarily from major domestic manufacturers, indicating a potential commonality in the industry. (But please do not underestimate ourselves; I have seen many designs from Japan and Europe that are far from ideal, even if they have obtained so-called safety certifications). I initially intended to summarize this in a few sentences, but upon writing the introduction, I found I had more to say. This discussion is not specifically about BTM, but its characteristics happen to make it easier to encounter such issues. I have previously attempted to summarize the key safety points in BTM design, and today I will briefly revisit them from a different perspective, hoping it remains meaningful.

Furthermore, using BTM as an example does not imply that BTM necessarily has these issues; rather, it suggests that BTM is likely to encounter such problems. All examples are merely for illustrative purposes; please do not take them personally.

Argument:

Safety software in railways should not operate based on event-triggered logic as the main driving force for software and safety core logic. Systems built under such a structure lack predictability and cannot implement timeout protections based on predictions; they also cannot confirm the maximum cycle of external events, making it impossible to ascertain the fault diagnosis’s SDT (self-diagnosis or mutual diagnosis).

Example Taking BTM as an example-BTM‘s background:

The responder transmission unit BTM is essentially part of the onboard train automatic protection system (ATP). In China, BTM is developed as a standalone system, attempting to adapt to various ATP/LKJ (onboard core) systems by continuously adding multiple protocols in the B interface logic. The software of BTM is roughly divided into two parts:a. B interface part, responsible for communication and data processing with the onboard core;b. A interface part, responsible for decoding messages from the responder. Additionally, BTM still retains the “nominally” controlled commands and mechanisms by the onboard core, such as switching amplifiers, triggering self-checks, requesting status, etc. This may create a gap between the onboard core development side and the BTM development side, particularly regarding the commands issued by ATP and the feedback received. ATP processes these as routine information because, for ATP, BTM is merely a subordinate black box executing actions as required; however, many BTM systems I have seen tend to have their own ideas about the commands received and the statuses uploaded, often prioritizing their own perspective and sometimes merely feigning compliance with the onboard core. Under the premise that BTM is an independently developed safety component and has addressed its own hidden risks, this approach is necessary because, when adapting to different ATPs and being a subordinate device with limited authority, it cannot rely on the overall protection of ATP to mitigate residual risks; it must handle all potential hidden issues through BTM itself.

Example Taking BTM as an example-BTM‘s characteristics:

This determines that:a. All BTM-compatible B interface protocols are essentially defined by the onboard core side, and BTM products have very little authority over the definitions of protocols, including protection domains and data domains;b. Since it is essentially a front-end subsystem of ATP, most BTM’s B interface software architecture is protocol-driven. Different protocols (broadly categorized into two types: one for the C/S response mode with the onboard core, and the other for the active sending mode) likely become decisive factors in the software architecture, resulting in the B interface side often being designed to trigger actions based on external events (calls from the onboard core).

Moreover, while the A interface software can maintain a consistent processing method, its external inputs are also event-triggered, meaning whether the responder is passed or whether messages are received.

Most of the logic under the B interface protocols and the A interface logic are thus event-triggered. This is a very unique aspect of BTM; I personally believe its uniqueness in railway signal safety devices is second only to the responder. Of course, this is due to the fact of event triggering and the need to meet the safety requirements of the responder transmission system (antenna self-check). The responder and BTM, regarding general railway signal safety products, have safety certification matters that are quite unusual. We know that our safety principles always prefer time-triggered over event-triggered; event triggering can lead to a series of problems. For example, in BTM, a common issue is that at least part of the 2 out of 2 timing (even the self-check cycle) becomes non-fixed, especially the maximum SDT time cannot be determined.

The real issue discussed today:

This issue is actually a core problem of BTM software architecture, which is how to comprehensively consider its software architecture to adapt to different B interface protocol requirements while also stabilizing its core safety processing logic, such as taking 2 and self-checks, to maintain a time-triggered structure as much as possible, i.e., not using the uncertain timing of external events as trigger conditions. Extending this further, it is what I want to convey today: Safety software in railways should not operate based on event-triggered logic as the main driving force for software and safety core logic. Systems built under such a structure lack predictability and cannot implement timeout protections based on predictions; they also cannot confirm the maximum cycle of external events, making it impossible to ascertain the fault diagnosis’s SDT (self-diagnosis or mutual diagnosis). I have noticed that many colleagues in railway systems seem to have a fondness for event-triggered structures in their software logic designs. Based on product characteristics and supplementary protective measures, we have case-by-case accepted some, but we absolutely do not recommend this practice. The reason for using BTM as an example is that it appears to be the most “innocent” in this context. Many software structures could completely avoid using event-triggered forms. Additionally, it should be noted that some developers often prefer a non-fixed cycle loop when using the main loop, which is not ideal because this approach sacrifices the predictability of software operation and makes it impossible to incorporate necessary timeout protection mechanisms.

Other safety critical points in BTM design (not directly related to today’s discussion, just as supplementary information):

Additionally, BTM’s safety considerations include clock domains (most B interface protocols correspond to the key judgment information for responder positioning based on the message decoding time window, which requires BTM’s time to be consistent with ATP and to timestamp the time sent by ATP at the decoding position of the A interface entrance. For this part of the logic, how BTM’s time runs and whether it is accurate is not important; what matters is that it must remain consistent with ATP’s time. However, BTM certainly needs its own clock domain to establish its operational and protective baseline, and sometimes it also needs to verify the time sent by ATP. Therefore, the division, maintenance, and connection of the internal clock domain of BTM is a key safety issue that BTM must address);Handling of sidelobes can be considered another; how to achieve safe decoding (safety strategies for the decoding part) is certainly one; antenna self-check (this part is not simply a trivial matter; its strategy and implementation are related to the overall safety function of the responder transmission system, i.e., the unavailability of responder crosstalk O1, O2, and the failure rate of responder group detection, the latter often relates to a situation where the antenna self-check result cannot be casually disabled SRAC).

Several general SRACs applicable to BTM have been proposed to define BTM’s applicable scenarios and its functional positioning in ATP, where the second point below is actually intended to compensate for the aforementioned BTM’s A interface part’s maximum time interval for taking 2 being uncertain, i.e., forcibly assigning a maximum SDT value (if no valid message is decoded, the corresponding taking 2 is indeed difficult to operate).

1. BTM products communicate and implement corresponding functions based on the B interface protocols defined by the onboard host unit, meaning the onboard host unit must ensure that the information sent is trustworthy. (Again emphasizing the primary position of ATP and the role of protocol formulators, and thus the corresponding processing responsibilities that must be borne. BTM can only process in accordance with existing protocols).

2. BTM products must meet the maximum time interval for continuous communication through adjacent responders as specified in Subset-091 for the mission profile. This assumption implicitly covers the requirement for the longest time the BTM product can remain operational while the train can stop on the line. (In fact, this is to compensate for the uncertainty of the maximum SDT in the A interface logic part for taking 2).

3. Products are applicable to the responder transmission system defined by UNISIG SUBSET -036 & -091 / TB T3485-2017 standards. If applied to other scenarios where the technical requirements are inconsistent with the aforementioned specifications/standards, a thorough and complete safety analysis must be conducted before application. The SIL4 function list of the responder transmission system can be found in TB/T3485-2017 Appendix F.1, and the definitions of operational and environmental scenarios are in SUBSET -036 Section 4.4.2.

Leave a Comment