The ‘No Return’ Path of Self-Developed Bootloader for STM32: An OTA Push Turns 100,000 Online Devices into Bricks, Rescued by the Reserved ‘Dual Partition’ Scheme

We were working on a smart streetlight control system, deploying over 100,000 terminals nationwide, with the core being the STM32F429. The system periodically receives OTA firmware updates via a 4G network, automatically rebooting after the upgrade, completely unnoticed by the users.

The initial OTA tests went smoothly, and everyone gradually got used to the automatic update process. Until that day, we pushed a new main program update, and 100,000 devices almost simultaneously “lost connection” within minutes.

At first, we thought the server had crashed, but later found out it wasn’t. Upon checking the device logs, we discovered:The main program was successfully written, but did not start.

At that moment, we realized that there might be an issue with the Bootloader.

Self-Developed Bootloader: Easier Said Than Done…

The Bootloader for our project was written by ourselves, without using ST’s official IAP example, mainly to facilitate differential updates and encryption verification. The tasks of the Bootloader are:

  • Determine if an upgrade is needed after power-up
  • If needed, pull the new program from external Flash or the server and write it into the main program area
  • Then jump to the main program to run

The logic is not complicated. We used single Bank Flash, with the Bootloader in the first few KB and the main program in the later area. Before jumping to the program, we modify the offset address of the interrupt vector table:

SCB->VTOR = APP_START_ADDRESS;

This line of code moves the “pointer” of the interrupt vector table from the Bootloader area to the main program area; otherwise, all interrupts would not respond after entering the main program.

In our previous tests, this part always worked fine, with smooth jumps and normal interrupts. But after this OTA, all devices got stuck in the Bootloader, and after setting VTOR, the MCU did not respond.

The OTA Incident Brought a “Silent Disaster”

The root of the problem was that we changed a startup-related initialization logic in the new version of the main program,accidentally clearing a segment of RAM used by the Bootloader.

This segment of RAM stored the upgrade status flag, which the Bootloader relied on to determine whether to jump. But it was cleared, so the Bootloader thought “the upgrade was not completed” every time it restarted, thus repeatedly entering the upgrade process.

Worse still, there was a check in the upgrade logic: if the upgrade failed more than three times, it would stop in the upgrade state, waiting for manual intervention.

So guess what? All 100,000 devices were stuck in the Bootloader’s upgrade waiting state and could not be remotely awakened.This meant they all “turned into bricks”.

Dual Bank Storage: Our Last Lifeline

In fact, we had reserved a “dual partition” architecture early in the project, but due to tight device resources in the early stages, it was never utilized.

This scheme is:

  • Bootloader retains a section to store a “backup program”
  • OTA firmware is written to the backup area, not directly overwriting the currently running main program
  • Only after verification, the running pointer is switched to the new area
  • If startup fails, the Bootloader can “rollback” to the old version

This is like a dual-boot system on a computer, where if one fails, you can switch back.

At that time, the STM32F429 we were using only had single Bank Flash, so we artificially divided the Flash into two areas:

#define APP_SLOT_1_ADDR  0x08020000
#define APP_SLOT_2_ADDR  0x08060000

The Bootloader records which area is currently running, and during the upgrade, it writes to the other area. Before jumping, it first calculates the CRC or digital signature to verify before switching.

Although the main program area was damaged in this incident, the backup area was still intact. We quickly updated the Bootloader to force it to jump to the backup partition:

if (verify_app(APP_SLOT_2_ADDR)) {
    SCB->VTOR = APP_SLOT_2_ADDR;
    jump_to_app(APP_SLOT_2_ADDR);
}

Then we arranged a “rescue OTA” to reboot the devices to start from the backup partition using low-level commands through the 4G module. This operation also carried significant risks, but at that point, we had no choice.

Fortunately, all devices had a reserved low-level communication channel, which, although slow, was relatively stable. We spent a full two days and nights to rescue most of the devices.

RDP and WRP Almost Caused Us More Trouble

When we initially designed the Bootloader, we enabled RDP read protection and partial Flash write protection (WRP) to prevent firmware tampering.

But this also brought new problems:

  • RDP set too high, preventing unlocking during the OTA process
  • WRP configuration was inflexible, causing new firmware to be unable to write to the new area

Fortunately, we had implemented logic to unlock and rewrite the WRP before the OTA; otherwise, we wouldn’t have been able to write to the backup partition this time.

In the future, we will set the RDP to Level 1, which allows for encryption and OTA upgrades.

FLASH_OBProgramInitTypeDef OBInit;
OBInit.OptionType = OPTIONBYTE_RDP;
OBInit.RDPLevel = OB_RDP_LEVEL_1;
HAL_FLASHEx_OBProgram(&OBInit);

This piece of code needs to run before every OTA to ensure that the device can enter the upgrade process normally.

Having Walked Through the “Brick Hell”, We Now Understand What a Bootloader Is

Previously, we always thought of the Bootloader as a small module that could be burned in and forgotten. Now we can no longer afford to think that way.

After this incident, our requirements for the Bootloader are even higher than for the main program:

  • OTA must be “dual partition” and cannot directly overwrite the running area
  • All jumps must verify CRC or signature
  • Startup failures must have a rollback mechanism to automatically revert to the previous version
  • All OTA operations must check the RDP/WRP status

We even assigned version control and gray release mechanisms specifically for the Bootloader to ensure it is always in a position to “fix things last”.

Leave a Comment