Several Poor Structures/Strategies of Railway Safety Embedded Software

In today’s context of empowering and innovating various railway products, including safety products, discussing such fundamental issues seems like emerging from a pile of old papers. However, the designs we occasionally see remind us that while running smoothly is indeed impressive, ensuring stability is always the most basic and important thing. From a technical perspective, achieving stability in safety has been a requirement met by major domestic signal suppliers for about a decade. Yet, for various reasons—perhaps due to a lack of significant accidents for too long, or the necessity to survive in a competitive environment, or the failure to disseminate experience, or the fearlessness of newcomers—the actual solutions, human errors, design levels, implementation results, subjective initiative, and personnel experience in safety products have likely declined in recent years. Therefore, let the old papers remain as they are; there are indeed several unrecommended structures/strategies in safety software that are worth mentioning. It is important to note that an unrecommended structure does not mean it is unworkable; it simply has more room for systemic failures. The issues identified are primarily from major domestic manufacturers, indicating a potential industry-wide problem. (But please do not underestimate ourselves; I have seen many designs from Japan and Europe that are far from ideal, even if they have obtained so-called safety certifications). I initially intended to summarize this in a few sentences, but upon writing the introduction, I found I had more to say. This discussion is not specifically about BTM, but its characteristics make it easier to encounter such issues in its design. I have previously attempted to summarize the key safety points in BTM design, and today I will briefly revisit them from a different perspective, hoping it remains meaningful.

Furthermore, using BTM as an example does not imply that BTM necessarily has these issues; rather, it suggests that BTM is likely to encounter such problems. All examples are merely for illustrative purposes; please do not take them personally.

Argument:

Safety software in railways should not operate based on event-triggered logic as the main software logic and safety core logic. Systems built under such a structure lack predictability and cannot provide timeout protection based on predictions; they also cannot confirm the maximum cycle of external events, making it impossible to ascertain the fault diagnosis’s SDT (self-diagnosis or mutual diagnosis).

ExampleTaking BTM as an example-BTM‘s background:

The responder transmission unit BTM is essentially part of the onboard train automatic protection system (ATP). In China, BTM is developed as an independent system, attempting to adapt to various ATP/LKJ (onboard core) systems by continuously adding multiple protocols in the B interface logic. The software of BTM is roughly divided into two parts:a. B interface part, responsible for communication and data processing with the onboard core;b. A interface part, responsible for decoding messages from the responder. Additionally, BTM still maintains “nominally” controlled commands and mechanisms by the onboard core, such as switching amplifiers, triggering self-checks, and requesting status. Therefore, there may be a gap between the onboard core development side and the BTM development side, particularly regarding the commands issued by ATP and the feedback received. ATP processes these as routine information because, for ATP, BTM is merely a subordinate black box executing actions as required; however, most BTM systems I have seen tend to have their own ideas about the commands received and the statuses uploaded, often prioritizing their own perspective and only superficially complying with the upper level. Under the premise that BTM is an independently developed safety component and has addressed its own hidden risks, this approach is necessary because, when adapting to different ATPs and being a subordinate device with limited authority, it cannot rely on the overall protection of ATP to mitigate residual risks; it can only address all potential hidden issues through BTM itself.

ExampleTaking BTM as an example-BTM‘s characteristics:

This determines that a. BTM is compatible with all B interface protocols defined by the onboard core, and BTM products have very little authority over the definitions of protocols, including protection domains and data domains;b. Since it is essentially a front-end subsystem of ATP, most of BTM’s B interface software architecture is protocol-driven, and different protocols (broadly categorized into two types: one for the C/S response mode with the onboard core and the other for the active sending mode) may become decisive factors in the software architecture, resulting in the B interface side often being designed to trigger actions based on external events (calls from the onboard core).

Moreover, while the A interface part software can maintain a consistent processing method, its external inputs are also event-triggered, meaning whether the responder has been passed or whether messages have been received.

The logic under most protocols in the B interface and the logic in the A interface are therefore both event-triggered. This is a very unique aspect of BTM; I personally believe its uniqueness in railway signal safety devices is second only to the responder. Of course, this is due to the fact of event triggering and the need to meet the safety requirements of the responder transmission system (antenna self-check). The responder and BTM, regarding general railway signal safety products, have safety certification matters that are not quite ordinary. We know that our safety principles always prefer time-triggered over event-triggered; event triggering can lead to a series of problems. For example, in BTM, a common issue is that at least part of the 2 out of 2 timing (even the self-check cycle) becomes non-fixed, especially the maximum SDT time cannot be determined.

The real issue discussed today:

This issue is actually a core problem of BTM software architecture, which is how to comprehensively consider its software architecture to adapt to different B interface protocol requirements while also stabilizing its core safety processing logic, such as taking 2 and self-checks, to maintain a time-triggered structure as much as possible, i.e., not using the uncertain timing of external events as trigger conditions. Extending this further, it is what I want to convey today, namely, that safety software in railways should not operate based on event-triggered logic as the main software logic and safety core logic. Systems built under such a structure lack predictability and cannot provide timeout protection based on predictions; they also cannot confirm the maximum cycle of external events, making it impossible to ascertain the fault diagnosis’s SDT (self-diagnosis or mutual diagnosis). I have noticed that many colleagues in railway systems seem to have a fondness for event-triggered structures in their software logic designs. Based on product characteristics and supplementary protective measures, we have case-by-case accepted some, but we absolutely do not recommend this practice. The reason for bringing up BTM is that it appears to be the most “innocent” in this regard. Many software structures could completely avoid using event-triggered forms. Additionally, it should be noted that some developers often prefer a non-fixed cycle loop when using the main loop, which is not ideal because this approach sacrifices the predictability of software operation and makes it impossible to incorporate necessary timeout protection mechanisms.

Other safety critical points in BTM design (not directly related to today’s discussion, just as supplementary information):

Additionally, BTM’s safety considerations include clock domains (most B interface protocols correspond to the key judgment information for responder positioning based on the message decoding time window, which requires BTM’s time to be consistent with ATP and to timestamp the time issued by ATP at the decoding position of the A interface entrance. For this part of the logic, how BTM’s time runs and its accuracy is not important; what matters is that it must remain consistent with ATP’s time. However, BTM certainly needs its own clock domain to establish its operational and protective baseline, and sometimes it also needs to verify the time issued by ATP. Therefore, the division, maintenance, and connection of the internal clock domain in BTM is a key safety issue that BTM must address); handling of sidelobes can be considered another; ensuring safe decoding (safety strategies for the decoding part) is certainly one; and antenna self-check (this part is by no means trivial; its strategies and implementations are related to the overall safety functions of the responder transmission system, including the unavailability of responder crosstalk O1, O2, and the failure rate of responder group detection, the latter often associated with a requirement not to casually disable the antenna self-check results of SRAC).

Several general SRAC applicable to BTM have been proposed to define BTM’s applicable scenarios and functional positioning within ATP, where the second item below is actually intended to compensate for the aforementioned BTM’s A interface part’s maximum time interval for taking 2 being non-fixed, essentially imposing a maximum SDT value (if no valid message is decoded, the corresponding taking 2 is indeed difficult to operate).

1. BTM products communicate and implement corresponding functions based on the B interface protocols defined by the onboard host unit, meaning the onboard host unit must ensure that the ATP system’s overall safety is maintained after integrating with BTM products implemented according to its defined B interface protocols. The onboard host unit must ensure that the information it sends is trustworthy. (Again emphasizing the primary position of ATP and the role of protocol formulators, and thus the corresponding processing responsibilities that must be borne. BTM can only process in accordance with existing protocols).

2. BTM products must meet the maximum time interval for continuous communication with adjacent responders as specified in Subset-091 regarding the mission profile. This assumption implicitly covers the requirement for the longest time the BTM product can maintain operational status while the train can stop on the track. (In fact, this is to compensate for the uncertainty of the maximum SDT in the A interface logic).

3. Products are applicable to the responder transmission system defined by UNISIG SUBSET -036 & -091 / TB T3485-2017. If applied to other scenarios where the technical requirements are inconsistent with the above specifications/standards, a thorough and complete safety analysis must be conducted before application. The functional list of the responder transmission system’s SIL4 can be found in TB/T3485-2017 Appendix F.1, and Section 4.4.2 of SUBSET -036. (Definitions of operational and environmental scenarios).

Leave a Comment