Understanding the Underlying Framework of Display Technology

1. The coupling of DPU and GPU is a historical product and can be completely independent.
2. Prototype design of DPU

2.1 Four main components of DPU
2.2 KSM and DPU

3. Latest design of DPU

3.1 Source Surface Pipes or Overlays
3.2 Blender
3.3 Destination surface post-processor
3.4 Display Interface

4. Summary

The DPU on the PC is embedded in the graphics card, whether it is a discrete graphics card or an integrated graphics card. Due to the increasing power of GPUs, the DPU is currently basically a bonus feature, but historically, the GPU is a newer entity, with the earliest only having DPU, as can be seen from the earliest Framebuffer mechanism, where there was no GPU code in the earliest versions of the DRM framework.

The simplest function of the DPU is to output Framebuffer data to the display device, and the source of the Framebuffer comes from CPU software rendering, not GPU rendering.

Understanding the Underlying Framework of Display Technology

The above image does not provide us with much inspiration as it is too far from our modern DPU design.

1. The coupling of DPU and GPU is a historical product and can be completely independent.

DPU is used for control, and GPU is used for content.

Through the Linux DRI display framework, we can also see the relative independence of KMS, corresponding to the system-side composer, while DRM is on the content-related application side. The same is true for the Android system, where the GPU corresponds to DRM (although Qualcomm and Mali do not follow this open-source DRM framework) for rendering, belonging to the application-side process; while the DPU corresponds to KMS, running on the server side, it can be considered that in SurfaceFlinger (composer), it will be initialized at boot and then remain unchanged, further separating the two.

Differences between Linux on PC and Android on mobile devices.

On PC, the coupling is still very strong, with DPU and GPU sharing video memory, and the code is also placed in one file. Buffer management (GEM/TTM) is naturally interoperable, and the default code in Linux is merged into one block, which is a historical legacy issue—Android is different, inherently separated, and ION is the standard for buffer allocation in Android.

On Linux platform: Looking at Qualcomm Adreno’s open-source Linux code, the system merges DPU and GPU into one folder: drivers/gpu/drm/msm, with functions basically separated, such as GPU-related: adreno, msm_gpu.c, msm_ringbuffer.c, while DPU-related are disp, edp, hdmi, etc. However, some code is still coupled together, such as msm_gem.c, mem_drv.c. GPU commands still use standard or custom DRM commands.

For GPUs, UMD uses Mesa (Qualcomm does not have official Linux support).

On Android platform: Qualcomm’s official code is in two completely different repositories, with no shared code, GPU placed in drivers/gpu/msm, configured with KGSL, while DPU is a non-open-source private library (OEM manufacturers can access it). This also indicates that there is not such a close logical connection between the two, just passing a framebuffer.

For GPUs, UMD is libGLES_xx.so (including GL and EGL), with no GEM and DRM, completely closed source, and OEMs cannot access the source code.

GPUs and DPUs can adopt different manufacturers, but usually, they are from the same company, why?

Buffer sharing is more efficient: although buffer sharing is done through ION, to save DDR bandwidth, shared buffers are often compressed, such as Arm’s AFBC and Qualcomm’s UBWC.

If different manufacturers are used, this can also be achieved, for instance, for ARM, Mali GPUs are still widely used today, but Mali DPUs are rarely used now, so an AFBC Decode module is attached, as shown below. (Qualcomm has not lifted this restriction).

What basic functions should the DPU have?

The design of the DPU is simpler than that of the GPU, due to its fixed functionality, which is not programmable. Its basic functions are about two.

1) 2D acceleration (scaling, composition).

The earliest Linux code still shows traces of this; initially, 2D acceleration was done using the CPU; later, 2D acceleration began to use the GPU to implement it. By the time of the Android system, a dedicated 2D module was implemented by the GPU (it could even be configured with dual GPUs, one for 2D acceleration), and then a dedicated DPU appeared to replace the GPU’s 2D module (subsequently, the GPU no longer had a dedicated 2D module, as 2D is a subset of 3D; although a specially designed 2D module would be a bit more efficient, it was not as effective as the DPU, so it was gradually phased out).

2) Management of vout (connecting to LCD, HDMI, and other devices).

Below is a basic design prototype of the DPU, which includes four parts.

2. Prototype design of DPU

2.1 Four main components of DPU

This is the DPU design diagram from 2013, when Android released the largest upgrade 4.4 (perhaps the most successful generation). From the diagram below, we can see that the DPU design is roughly divided into four parts:

1) Source Surface Pipes (also known as overlays, not differentiated later): supports four overlay channels (V1-V4), supports multiple formats such as RGBX, YUV, etc., scaling ratios (1/4 – 4), and each layer supports an alpha channel.

C1 and C2 are mouse layers, which are very important for PCs but are rarely used on mobile phones.
At that time, rotation was not supported;
Supports alpha blending of four layers, which was quite luxurious at the time; for example, monitoring did not need to be designed this way; some designs had 16 layers that looked impressive, but only one supported alpha, which was of no use; for Android systems, there are particularly many alpha layers.

2) Blender: supports two blenders, corresponding to two paths (in addition to LCD, corresponding to DP or HDMI projection);

3) Destination surface post-processor: supports dithering, gamma adjustment; currently, this part is becoming increasingly important.

4) Display Interface: supports up to two output devices simultaneously (physical display devices, virtual display devices do not require actual output devices); supports LVDS, DSI, CVBS, HDMI, and other display devices;

A more detailed diagram of the DPU is shown below:

If placed in the Android system, we can look at the playback process of an HDR video to better see these four parts.

2.2 KSM and DPU

This diagram is also very consistent with the common DRM KSM framework diagram, indicating that KSM and DPU functionalities are almost identical:

Source Surface Pipes: each overlay corresponds to a Plane, each overlay has a DRM Framebuffer; when dumpsys SurfaceFlinger, each layer is an overlay, a DRM Framebuffer.

-----------------------------------------------------------------------------------------------------------------------------------------------
 Layer name
           Z |  Window Type |  Layer Class |  Comp Type |  Transform |   Disp Frame (LTRB) |          Source Crop (LTRB) |     Frame Rate (Explicit) [Focused]
-----------------------------------------------------------------------------------------------------------------------------------------------
 com.android.systemui.ImageWallpaper#0
  rel      0 |         2013 |            0 |     DEVICE |          0 |    0    0 1080 2400 |    0.0    0.0 1080.0 2400.0 |                              [ ]
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
 com.miui.home/com.miui.home.launcher.Launcher#0
  rel      0 |            1 |            0 |     DEVICE |          0 |    0    0 1080 2400 |    0.0    0.0 1080.0 2400.0 |                              [*]

CRTC: corresponds to a Path, each Path has at least one Blender, DSPP is also here (as shown in the diagram below); all layers will perform alpha blending and output to a display path, usually formatted as RGB24 and RGB30 (10bit), which is the content displayed on the screen.
Display Interface: Encoder, Connector is related to the display device; ultimately this RGB24 data will be transmitted to the display device via MIPI, DP, and the implementation of these protocol-related aspects is also completed within the DPU module.

3. Latest design of DPU

3.1 Source Surface Pipes or Overlays

1) Pipes (also called overlays) are generally divided into two types: (this cannot be called a trend, Qualcomm has had these two types from the beginning)

One supports complex functions such as scaling, rotation, sharpening, etc., known as Video overlay (of course, video overlay can apply to any layer, video is just a term, more suitable for games and videos); here, scaling and sharpening are single-layer, different from the scaling of the entire screen.
A simple function Graphic overlay; supports format conversion and also supports alpha;

2) Supports larger input resolution, such as supporting 4K input, which requires higher DPU frequency;

3) With the emergence of XR (AR, VR) devices, single-eye 4K has already appeared (the DPU must support 8K input), which places great bandwidth pressure, so the current approach is usually not direct 4K input, but split into two 2K (of course, this will reduce the available layers by half), this is the Split function; (this is not a new feature, as 4K video has appeared many years ago).

4) Supports rotation, mainly used for video playback, other scenarios are generally not used; the GPU will pre-rotate. (Mali DP 650 supports this, but the main bandwidth impact is too large; from the kernel open-source code, it seems that after DP650, Mali has not updated).

5) The number of pipes is increasing, such as 8, 16 (it will not be more than this).

For mobile phones, at least 6 are needed: 1. Main activity (2 layers); 2. status and navigation bar (2 layers) 3. round corner (2 layers, Qualcomm has optimizations for round corner areas that never change);
For applications used in TV boxes, scaling needs to be considered; each layer will be scaled (so a destination scaling is needed, not a source).

6) Supports compressed formats (UBWC or AFBC); reduces memory bandwidth, especially the interaction bandwidth with GPU.

Summary: These technologies have been around for many years, and no future trends can be seen, except for the third point, as XR’s pursuit of resolution is still not over; single-eye 8K will come, thus the DPU must support 16K input, and this bandwidth pressure is enormous (especially when scaling down); even if split into two 8K, the pressure is still significant, so whether to create two DPUs in the future is uncertain.

3.2 Blender

1) The number of layers for composition is increasing, such as supporting composition of 10 layers (most layers do not actually overlap);

2) The number of composition paths is increasing, such as supporting 4 (using 3 simultaneously is already very rare).

WFD (Virtual display device) is also considered a path; for XR, each 2D application is implemented through WFD, and WFD is implemented by the DPU’s writeback function, which generally only supports one path; if there are multiple WFDs, it can only be implemented with the GPU.
If there is future development, it will be whether to increase the Writeback path; if it is not cost-effective, it will need to consider using only one virtual display device, placing all 2D applications there.

3) Supports 3D functions; (can distinguish left and right eyes, as 3D functionality has been popular for many years, so it is not a new technology).

4) Dim layer: a common scene on Android, as a gradient color, with only changes in gray value, nothing else changes;

If everyone understands DC dimming on OLED screens, they will know that Oppo’s initial solution was to add a dim layer and then adjust this gray value to make the screen appear less bright, thus avoiding PWM dimming.

5) Background color: for monochrome images, there are also some optimization schemes.

Summary: For 4 and 5, these are completely optimization schemes added based on application scenarios to save power, which is a small saving. The future development of XR may further optimize the Writeback function;

3.3 Destination surface post-processor

Initially, post-processing was only dither, gamma correction, brightness, contrast, and saturation adjustments, which were not important among the four modules, but this has become the fastest-growing module in recent years. Many flagship phones now use independent display chips like PixelWorks (hereinafter referred to as PW), and the advertised functions are: MEMC, HDR, sunlight screen (CABL), eye protection mode, Demura, and super-resolution; Qualcomm has all these, placed in their post-processing.

1) Super-resolution and sharpening.

Here, super-resolution refers to the Destination Scaler, which is done on the entire screen data, different from the super-resolution targeting layers in the previous source pipe, although the algorithms are the same.

Currently, platforms almost no longer use simple bilinear interpolation but their own algorithms, although MTK claims to support AI super-resolution, the results have not impressed everyone.

On PC, there are NVIDIA’s AI super-resolution DLSS, AMD’s traditional super-resolution FSR, which have received good feedback online, but on mobile, either the power consumption is high, or in high PPI application scenarios, the advantages of super-resolution are not significant (the FSR super-resolution algorithm that performs well on PC really does not perform well on mobile).

As XR’s demand for resolution continues to rise, this demand will continue to develop and is a future direction.

2) Supports HDR, SDR to HDR, are basic operations.

3) Brightness adjustment: different from Android’s adjustment based on ambient light, mainly based on content backlight adjustment algorithms. It can be divided into indoor and outdoor. Indoor lighting is not strong, mainly by CABL and FOSS, which target LCD and OLED screens; outdoor uses ARM’s sunlight screen technology; of course, Qualcomm later adopted its own Local Tone Mapper strategy (which can be used for both indoor and outdoor) to replace ARM’s sunlight screen technology, mainly to enhance image details in dark areas without allowing highlights to become oversaturated.

4) MEMC: a standard feature on TVs, now also placed on mobile phones, mainly on videos, which was the most important reason PW was initially introduced into mobile phones, achieving smoothness from 30 frames to 60 frames video through frame insertion.

5) Demura: a must-have process on OLED.

Summary: Processing the same work within the DPU will also lower power consumption; PW is placed after the post-processing interface module, so PW will have higher power consumption; if DDIC does it, the power consumption will be even higher; the closer to the front, the lower the power consumption. It is not just about the process but also about the manufacturing process, so the value of PW lies in its algorithm capabilities, whether it can exceed those of Qualcomm or MTK.

3.4 Display Interface

There are many aspects, and I will not list them all (later we will specifically talk about MIPI), but it can be seen that future development still lies in XR.

4. Summary

DPU is divided into four parts, and the functions have become quite stable: among them, display post-processing is a key area for future upgrades (with super-resolution and sharpening being key optimization areas);

XR will greatly influence the development of DPU: whether it is the bandwidth pressure brought by resolution or the latest technology such as gaze point transmission, DPU will need to make significant changes.