Debugging Issues More Challenging than Audio Stuttering

For programmers writing application software, memory corruption (accidental variable overwrite) is a difficult debugging issue.
For operating system software developers, application layer issues are much simpler because problems in applications occur within a single process and do not affect the entire system.
This year, while preparing the system image for Ulan, I encountered problems with the audio subsystem. After spending a lot of time resolving this issue, I wrote an article titled “Debugging Issues More Challenging than Memory Corruption”.

Debugging Issues More Challenging than Audio Stuttering

In recent months, I have been working on a project codenamed “Ziyou”. The goal of this project is to enable UEFI on Ulan (Enable UEFI on Ulan, EUU). Ulan originally used u-boot as its kernel loader. U-boot is popular in the ARM ecosystem due to its compactness. However, it lacks many advanced features compared to UEFI, such as dynamic module loading and a user-interactive GUI.
The Ziyou project is based on the open-source edk2-rk3588 project:
https://github.com/edk2-porting/edk2-rk3588
This open-source project aims to develop EDK2-based UEFI firmware for the RK3588 platform. EDK2 is an open-source UEFI implementation released by Intel, primarily used in the x86 ecosystem.
The initial issue with the Ziyou project was the slow access speed to GitHub, and since this project contains many sub-projects, I wasted some time completing the missing sub-project files.
The development environment for the Ziyou project is the Ulan codebase, using the GNU toolchain for compilation. Code editing is done with VS Code. For a project of EDK2’s size, building on Ulan is very straightforward; after issuing the build command, compilation, linking, and automatic image file generation are completed in just a few seconds.
5824+1 records in5824+1 records out5964288 bytes (6.0 MB, 5.7 MiB) copied, 0.0728761 s, 81.8 MB/s+ cp /gewu/edk2-rk3588/workspace/RK3588_NOR_FLASH.img /gewu/edk2-rk3588/+ set +xBuild done: RK3588_NOR_FLASH.img
Initially, we used Rockchip’s factory tool (rkdevtool) to flash the compiled image onto Ulan’s NOR FLASH, but the speed was slow, and errors occasionally occurred. Therefore, we switched to using an SD card to flash the image onto the SD card and then boot from it, significantly improving efficiency.
In the edk2-rk3588 project’s documentation, it is recommended to use the balenaEtcher tool to flash the SD card. However, this tool is quite large and developed with Node.js; using Node.js for such a simple interface seems like overkill.
Debugging Issues More Challenging than Audio Stuttering
Therefore, I tried using the lightweight Rufus, which actually worked well and was more stable than balena. This further improved work efficiency.
Because we are using the SD card to load the UEFI image for testing, we cannot use the debugging tool simultaneously. Therefore, serial port printing became the primary debugging method.
We used a debug version for testing, which provided a lot of debug information. The serial port rate is 1500000, which is very fast. As soon as the power button is pressed, debug information starts pouring out from the serial port.
Skipping the smoothly progressing tasks, a strange bug appeared this Tuesday. When booting with the new image, the debug information from the serial port stopped abruptly after a few seconds.

Debugging Issues More Challenging than Audio Stuttering

The stopping point of the debug information often occurs after loading the RkSdmmcDxe driver.
BaseClkFreq 52000KHz Divisor 65 ClockFreq 400KhzDwSdExecTrb: Command error. CmdIndex=1, IntStatus=104EmmcIdentification: Executing Cmd1 fails with Device Error
Reading the error information, it seems related to the SD card, but looking back, this SD card error is not the root problem.
After several attempts, the stopping point is not always the same.
Sometimes, the stopping point is in the middle of a sentence. It seems like a person was interrupted while speaking.
While commuting home yesterday, I continued to ponder this issue, thinking about why the point where the debug information stops is unstable.
One possibility is that the debug information is cached, so when it crashes, some information is still in the cache and hasn’t been sent out.
After dinner, I checked the EDK2 source code and traced the call relationship to find the function that writes to the serial port, SerialPortWrite.
UINTNEFIAPISerialPortWrite (  IN UINT8     *Buffer,  IN UINTN     NumberOfBytes){  UINTN  Result;  UINT8  Data;  if (Buffer == NULL) {    return 0;  }  Result = NumberOfBytes;  while ((NumberOfBytes--) != 0) {    //    // Wait for the serial port to be ready.    //    do {      Data = IoRead8 ((UINT16) gUartBase + LSR_OFFSET);    } while ((Data & LSR_TXRDY) == 0);    IoWrite8 ((UINT16) gUartBase, *Buffer++);  }  return Result;}  
According to the logic of this function, once it starts sending a message, it should finish sending it; otherwise, the function will not return. This means that the situation I suspected on the way home was ruled out.
Today, the colleague who was debugging this issue with me took a day off. I continued to fight this strange problem alone.
On one hand, I continued to think about why the debug information does not stop at a fixed point. During the firmware stage, usually only one CPU is working, and this CPU executes in a relatively simple sequential manner without complex concurrency scenarios. Based on this feature, if there are null pointer issues in the code, the CPU should crash at a fixed point, and the printed debug information should be stable.
On the other hand, I started to pay attention to another key detail: after each failure, the power button would become unresponsive for a period, which is not short and may last for dozens of seconds.
Following these two clues, I made a bold hypothesis that the problem lies in the power supply logic. Imagine if the CPU is unexpectedly powered off, then its execution position and printed debug information would have randomness.
In mobile devices such as tablets, laptops, and phones, there is usually a dedicated chip responsible for power logic. This chip is called a Power Management Integrated Circuit (PMIC). Years ago, when I was in Taipei supporting Intel’s tablet project, I dealt with this chip and understood its importance.
For Rockchip’s RK3588 platform, there are two PMIC solutions: one uses an RK806 and a small RK860 (a small glass chip less than 2mm square), while the other uses two RK806 chips. The former is lower cost and is more commonly used in small development boards like OrangePi, while the latter is more expensive. Ulan uses the latter.

Debugging Issues More Challenging than Audio Stuttering

(Ulan motherboard, with one RK806 on each side of the SoC)
The edk2-rk3588 project currently supports hardware mainly like ROCK 5B, OrangePi, etc. These development boards generally use one RK806, which explains why the image we compiled can successfully boot to the UEFI interface on small development boards.
Following this bold hypothesis, I sought help from hardware engineers while deeply analyzing the power-related code, especially the parts supporting RK806 and RK860.
I had previously removed the RK860 code and tried adding it back today, but the issue persisted.
In the afternoon, I continued analyzing the execution logic of RK806 and its debug output.
RK3588InitPeripherals: EntryRK806Init(605): base: FEB20000SpiCongig(565): 0: 0buck_set_voltage: volt=750000, buck=1, reg=0x1A, mask=0xFF, val=0x28buck_set_voltage: volt=750000, buck=3, reg=0x1C, mask=0xFF, val=0x28buck_set_voltage: volt=750000, buck=4, reg=0x1D, mask=0xFF, val=0x28buck_set_voltage: volt=850000, buck=5, reg=0x1E, mask=0xFF, val=0x38buck_set_voltage: volt=2000000, buck=7, reg=0x20, mask=0xFF, val=0xB5buck_set_voltage: volt=3300000, buck=8, reg=0x21, mask=0xFF, val=0xE9buck_set_voltage: volt=1800000, buck=10, reg=0x23, mask=0xFF, val=0xADnldo_set_voltage: volt=750000, ldo=1, reg=0x43, mask=0xFF, val=0x14nldo_set_voltage: volt=850000, ldo=2, reg=0x44, mask=0xFF, val=0x1Cnldo_set_voltage: volt=750000, ldo=3, reg=0x45, mask=0xFF, val=0x14nldo_set_voltage: volt=850000, ldo=4, reg=0x46, mask=0xFF, val=0x1Cnldo_set_voltage: volt=750000, ldo=5, reg=0x47, mask=0xFF, val=0x14pd_e_otg: volt=1800000, ldo=1, reg=0x4E, mask=0xFF, val=0x68pd_e_otg: volt=1800000, ldo=2, reg=0x4F, mask=0xFF, val=0x68pd_e_otg: volt=1200000, ldo=3, reg=0x50, mask=0xFF, val=0x38pd_e_otg: volt=3300000, ldo=4, reg=0x51, mask=0xFF, val=0xE0pd_e_otg: volt=3300000, ldo=5, reg=0x52, mask=0xFF, val=0xE0pd_e_otg: volt=1800000, ldo=6, reg=0x53, mask=0xFF, val=0x68
Finally, I discovered the critical flaw: the current code only configures one RK806 and does not configure the other one. Therefore, I referenced the kernel code and added the configuration for the other one RK806.
After making this modification, recompiling, and burning the test, the debug information continued to pour out, surpassing all the positions where it had frequently stalled over the past few days, reaching the expected part of trying various boot options, and then the familiar UEFI interface appeared! The major bug disappeared.

Debugging Issues More Challenging than Audio Stuttering

In the division of labor within computer systems, firmware plays a crucial role in initializing hardware and handling hardware differences, including configuring PMIC chips and initializing power supply facilities. Therefore, the importance of firmware code is self-evident.
From a development perspective, firmware development is even lower-level than driver development. Those engaged in firmware development must understand necessary hardware knowledge. Therefore, among my old Intel colleagues, many are specialized firmware engineers.
In terms of the severity of bugs, issues at different levels have varying degrees of seriousness. If there are problems in the firmware layer code, it could lead to power supply failures as described in this article, potentially causing the CPU to be powered off at any time.

Debugging Issues More Challenging than Audio Stuttering

In summary, debugging issues more challenging than audio stuttering refers to unexpected power loss.
Software Debugging Workshop Lushan Stationis currently recruiting. Colleagues who are interested in delving into complex software issues are welcome to register and think about software together in Lushan, exchanging debugging skills.
(Writing articles is hard work, so I kindly ask all readers to click “Looking” and also welcome shares)
*************************************************

With sincerity and dedication, we examine software with a humanistic perspective and use software technology to change lives.

Scan the QR code below or search for the “Shengge Academy” mini-program on WeChat to read more articles and audiobooks.

Debugging Issues More Challenging than Audio Stuttering

Also welcome to follow the GeYou public account

Debugging Issues More Challenging than Audio Stuttering

Leave a Comment