Why Lightweight Kernel Abstractions for RTOS Are a Mistake

As a pioneer and innovator in RTOS, the author shares a very clear viewpoint: Many engineers feel that all RTOS options are similar when choosing an RTOS, and believe that some lightweight kernel abstractions can meet the major demands for safe and reliable systems today. This thought is fundamentally flawed. No company or project team should casually choose an RTOS and hop between different RTOS options; instead, they should select a powerful RTOS and commit to it.

Real-time operating systems (RTOS) are not just an application programming interface, like some libraries. However, advocates of RTOS abstraction seem to think that the RTOS API is the most important aspect. I first heard the misunderstanding that “all RTOS are the same” 20 years ago. Since then, it seems that more and more people have been making the same mistake, leading developers astray.

RTOS abstraction is basically the lowest common denominator among the RTOS it covers. Therefore, it can only be a lightweight RTOS API, losing many powerful features of the underlying RTOS. Projects and companies should not flit between different RTOS like butterflies in a flower garden. While this may seem attractive in theory, it is not a good plan.

Reasons to Choose RTOS Carefully

Projects and companies should select a powerful RTOS and stick with it. Some RTOS implementations are very good, while others are poorly implemented. If a required feature is missing in an RTOS, it will require adding green code in the application, wasting a lot of effort in redesign, implementation, and testing. Why would anyone want to do that?

For example, some RTOS provide non-recursive mutexes, which are the worst in real-time embedded systems. If a task needs to test the same mutex twice, it will stop running. The additional amount of RTOS code required to provide recursive mutexes is trivial. Some RTOS will provide timeout mechanisms for all operations that may cause a task to wait, while others do not. In the previous example, without a timeout mechanism, the task will stop indefinitely, potentially leading to system failure, while with a timeout mechanism, the task will resume and report the issue in a timely manner.

An important goal of RTOS is to help developers avoid errors. For example, some real-time operating systems check every parameter in service calls, while some RTOS do little or no checking. Parameter checking can reduce debugging time, and for delivered systems, it helps prevent potential defects, hacking, and soft errors.

Many RTOS I have studied exhibit problematic behaviors. For example, there is a well-known RTOS that cannot reinsert a task into the queue when its priority changes, which is contrary to logical expectations. In fact, increasing a task’s priority is often done to allow it to run faster while waiting for an event. In this case, the programmer may be confused as to why increasing a task’s priority does not improve its performance. Additionally, if a queue is deleted, this RTOS cannot automatically release the tasks waiting in the queue. Instead, it advises developers: “If there are tasks waiting in the queue, do not delete this queue.” This is a great example of how it fails to help developers avoid errors; who would look at those tedious details?

Another well-known RTOS does not specify which heap to release a block after assigning a pointer to a block. This is a subtle issue: what happens if the pointer is wrong? A more rigorous implementation would require specifying the heap to perform range testing on the pointer before using it. This is another example of preventing programmers from making mistakes. The above function also cannot prevent double-free errors, which are a common programming mistake. Similarly, this RTOS also implements a message queue flush service, returning a success flag to tasks waiting to put their messages in the queue. This seems counterintuitive. Shouldn’t these tasks be notified that their messages were not delivered so they can take corrective action?

Another focus is how to handle interrupt service routines (ISRs). Most RTOS allow certain kernel services or modified kernel services to be called directly from ISR. This contradicts good programming practice, which is to keep ISRs as short as possible to ensure timely completion of ISRs even under heavy load. Failing to do so could result in missed ISRs, leading to erroneous system behavior.

Furthermore, allowing ISRs to call kernel services greatly increases the attack surface for hackers. While a small ISR may have no vulnerabilities, kernel services cannot be guaranteed to be free of vulnerabilities. Additionally, delayed interrupt handling is performed by tasks, which are likely signaled from ISR. The problem with this approach is that the task may be blocked by other tasks, resulting in delayed completion of interrupt handling. A better approach is for the ISR to call a linked service routine (LSR) that runs after all ISRs and before all tasks. This not only provides reliable performance but also offloads most of the ISR code to the LSR.

Many RTOS allow task control blocks (TCBs) and other control blocks to grow uncontrollably. However, control blocks must be as compact as possible; otherwise, the number of tasks or other RTOS objects may need to be limited, leading to suboptimal performance.

Security

Believing that secure systems can be built on lightweight kernel abstractions is a wishful thought. Our high-security real-time operating system SecureSMX utilizes almost all features of the feature-rich smx kernel. A highly secure system needs at least the following kernel features:

● Comprehensive checks of service call parameters, reporting errors to a central error manager for appropriate actions (such as partition restarts and reporting to the security center).

● Recording all operations or a selected subset of them in an event buffer and regularly uploading to the security center to check for any potential illegal intrusions or exceptional activities indicating potential defects.

● Support for multiple heaps. This way, isolated partitions needing heap services can have their own local heaps. (Sharing heaps between partitions provides hackers with a great path for intrusion between shared heap partitions).

● Assigning heap allocations for protected information, protected blocks, and task stacks in Cortex v7M and v8M regions.

● A message exchange system for multi-entry sites that can support:

➤ Variable size messages

➤ Message priorities (to ensure higher priority messages are served first)

➤ Passing message priorities to server tasks (controlling their runtime)

➤ Response exchange specifications (for returning results)

● Protecting messages that contain self-contained area information. These messages are loaded into the MPU and server’s MPA for server processing. These are the foundations for providing strong isolation ports between partitions.

● An effective runtime limit system, not just time slicing.

● Having recursive and priority inheritance or ceiling priority mutex, or both.

● One-time tasks that share stacks from a common pool. These tasks are very useful because partitions will always lead to more tasks, most of which simply perform a simple function (like handling messages from ports) and then wait for the next request without needing a stack.

● An effective method to minimize ISR code and perform delayed interrupt handling within isolated partitions.

It should be noted that a kernel that helps programmers avoid mistakes also helps reduce vulnerabilities. Unfortunately, neither static code analyzers nor other code analysis tools can detect RTOS usage vulnerabilities.

Adding all the above features to lightweight kernel abstractions will produce a lot of green code, which will require redesign and testing, taking years!

Conclusion

The idea that some lightweight kernel abstractions can meet today’s demands for safe and reliable systems is fundamentally flawed. Projects need to examine the engine driving their applications under the RTOS API to determine whether the engine’s capabilities are robust enough; otherwise, it could lead to budget overruns or even project cancellations, or worse, lawsuits. Security issues should never be taken lightly.

Note: The terms “kernel” and “RTOS” can be used interchangeably in this article.

Author: Ralph Moore, Source: Electronic Engineering Magazine

Original reference: rtos-abstractions-are-wrong.

END

Exciting DIY Articles

1.I built a portable mini-computer

2.Homemade toilet occupancy sensor, much criticized by family

3.This problem has troubled me for over 5 years, costing me millions

4.Software engineer tackling soldering, focusing on both software and hardware

Related posts

Leave a Comment Cancel reply