Arm’s Bifrost Architecture and Mali-G52: The Design Philosophy of Low-Power GPUs

Table of Contents

1. Introduction: The Unique Positioning of Arm Mali GPUs2. Software API: The Evolution from VLIW to Scalar Dual-Dispatch3. GPU Architecture: Highly Configurable Modular Design4. Shader Frontend: Clause-Based Execution and Thread Scheduling5. Register Access and Execution Pipeline: Flexible Data Processing Capabilities6. Memory Pipeline: Independent Paths for Textures and Load/Store7. L2 Cache and System Hierarchy: Challenges of Chip Integration8. Rasterization and Tiled Rendering: Bandwidth Optimization Tools for Mobile GPUs9. Computational Performance: FluidX3D Real-World Application Testing10. Power Management: Fine-Grained Power Control Strategies11. Conclusion: Market Value and Technical Features of the Bifrost Architecture

1. Introduction

Arm is renowned for its Cortex CPU series. However, Arm has now expanded its business to various licensed IP modules, covering interconnect architectures, IOMMUs, and even GPUs. The reason GPUs have become an interesting topic of discussion is that they have evolved into highly programmable complex components, comparable in complexity to CPUs. Similar to CPU performance, GPU performance is highly visible to users and constitutes an important part of device specifications.

Arm Mali GPUs target low power and embedded devices, a characteristic that is consistent with Arm Cortex CPUs. As a GPU, Mali needs to address the same fundamental issues faced by high-performance discrete GPUs familiar to gamers and PC enthusiasts. Graphics processing itself is highly parallel, making it well-suited for hardware architectures capable of tracking a large number of parallel tasks and mapping them to large-scale execution units. However, power consumption and chip area constraints force architects to finely tune their parallelization strategies. The power budget for low-end laptop GPUs is at most a few dozen watts, and exceeding 6W of power consumption becomes untenable in mobile phones or tablets. Chip area constraints are equally stringent, as integrated GPUs must share limited chip space with CPUs and numerous accelerator modules.

Amlogic S922X Architecture

Another significant difference from AMD, Intel, and Nvidia GPUs is that:Mali is licensed as an independent IP module, and Arm does not directly participate in the chip design process. The actual operating model is that chip manufacturers purchase Arm IP and integrate various third-party IP modules to achieve chip-level goals. This business model makes Mali unique in the GPU field. Mali is solely responsible for 3D rendering and parallel computing, neither providing hardware acceleration for video encoding/decoding nor having the capability to drive displays independently—functions that are basic expectations for GPUs from PC users. However, excluding such functionalities from the Mali suite allows chip manufacturers to flexibly choose video and display engines. In theory, manufacturers could even completely omit video display modules and use Mali purely as a computational accelerator.This lack of control over chip design presents additional challenges: Mali must maintain excellent performance across the broadest range of application scenarios to expand its customer base, but Arm cannot control the critical chip-level memory subsystem in achieving this goal.

Arm Mali Graphics Architecture

2. Software API: The Evolution from VLIW to Scalar Dual-Dispatch

Bifrost is Arm’s second-generation unified shader architecture introduced around 2016, with its predecessor Midgard lagging far behind the application of unified shader technology in AMD, Intel, and Nvidia’s GPU product lines. This article will discuss data from the Mali-G52, which is equipped in the Amlogic S922X. Given that the Mali-G52 is a micro GPU, we will also introduce Qualcomm’s Adreno 615, which is equipped in the Snapdragon 670, as a comparative reference.

The GPU programming interface underwent significant changes from the late 2000s to the early 2010s. Graphics APIs fully transitioned to a unified shader model, allowing different shading stages to run on the same execution pipeline. As the flexibility of execution pipelines increased, GPU computing technology rapidly emerged. Arm’s Midgard rode the wave of programmability by supporting OpenGL ES 3.0, Vulkan, and OpenCL 1.1. Although Midgard is compatible with modern APIs, its VLIW4 architecture has performance vulnerabilities—compilers often struggle to extract sufficient instruction-level parallelism to fill the VLIW4 instruction bundle, especially when processing computational code. Even in graphical code, Arm noted that three-component vectors are extremely common, which can lead to one component of the VLIW4 being idle.

Four-Way Parallel Execution

Bifrost adopts a four-way parallel execution model

  • • Four scalar threads execute in lockstep as “quads”
  • • Each pipeline stage executes one quad simultaneously
  • • Each thread occupies a 32-bit computation channel in hardware
  • • Four threads executing vec3 FP32 addition require three cycles

Effectively improving utilization

  • • Four-way vectorization is compiler-friendly
  • • Each thread only processes scalar operation streams
  • • Vector operations can always be decomposed into scalar forms

To overcome the shortcomings of Midgard, Bifrost shifted to a scalar dual-dispatch execution model. From a single-thread perspective, the register width changed from 4×32-bit vectors to 32-bit scalars. Bifrost no longer relies on a single thread to fill four FP32 computation channels but instead fills the execution engine’s 4 or 8 channels through multiple API threads. Each Bifrost channel simultaneously feeds into the FMA and FADD execution pipelines, thus still benefiting from instruction-level parallelism. However, packaging two operations into a single instruction is much simpler than packaging four operations, and Arm hopes to achieve more stable performance with a simpler compiler.

The transition from SIMD to scalar execution within each API thread in Bifrost is reminiscent of AMD’s transition from TeraScale to GCN, both aiming to enhance the performance consistency of computational workloads.

3. GPU Architecture: Highly Configurable Modular Design

Although Midgard and Bifrost differ significantly at the execution pipeline level, their overall architectural hierarchy is quite similar. The Bifrost execution engine includes execution pipelines and register files, replacing Midgard’s arithmetic pipelines. In industry-standard terms, the EE roughly corresponds to Intel’s execution units or AMD’s SIMD units.

Multiple EEs combine to form a shader core. The message exchange network within the SC connects the EEs to the memory pipeline and other shared fixed-function hardware. The texture and load/store units within the SC are equipped with level 1 caches, making their functionality comparable to Intel’s sub-slices, AMD’s CU/WGP, or Nvidia’s SM. The difference is that: Bifrost places the pixel backend at the shader core level, while desktop GPU architectures typically place it at a higher partition level, and Bifrost has no other partition levels above the shader core.

A hallmark of the Mali architecture is its provision of extremely rich GPU scaling options. In addition to adjusting the number of shader cores, Arm can dynamically configure the number of EEs per SC, cache capacity, and ROP/TMU throughput.

Arm Mali-G51 Configurable Single EE SC with 1 Group of TMU/ROP, or Up to Three Dual EE SCs with 2 Groups of TMU/ROP

This flexibility extends to the EE units themselves: they can run 4-wide or 8-wide thread bundles and are equipped with corresponding width execution pipelines. As a result, Arm can finely tune GPU scaling across multiple dimensions, accurately hitting performance, power, and area targets. In contrast, AMD and Nvidia typically use a unified WGP/SM structure from integrated GPUs to flagship products over 300W.

Diagram of Mali-G52 Architecture in Amlogic S922X

Bifrost is theoretically scalable to 32 shader cores. If all are configured as three EE SCs, it can provide 1.23 TFLOPS of FP32 FMA performance at 800MHz, comparable to Intel’s largest Skylake GT4e configuration. While not high performance by independent GPU standards, it far exceeds the capabilities of ordinary mobile phones/tablets. The Mali-G52 in Amlogic S922X adopts a small Bifrost configuration: 2 three EE SCs running at 800MHz, each EE being 8-wide architecture. Qualcomm’s Adreno, on the other hand, achieves scaling by changing the number of shader processors and uSPTP scale, with Adreno 6xx’s execution unit partition scale reaching 64-wide or 128-wide.

Diagram of Adreno 615 Architecture in Snapdragon 670 (ROP throughput unknown, temporarily indicated by red blocks)

All shader cores of the Bifrost GPU share an L2 cache, connected to the rest of the system via a standard ACE memory bus—Arm’s technical influence stops here.

4. Shader Frontend: Clause-Based Execution and Thread Scheduling

While the instruction cache capacity of Bifrost has not been disclosed, tests show that its instruction throughput peaks when the loop body contains ≤512 FP additions; when the loop body exceeds 1280 FP additions, throughput experiences a secondary decline.Bifrost uses a 78-bit instruction format, with each instruction containing two operations corresponding to the FMA and FADD pipelines. The Arm compiler can issue FP addition instructions to both pipelines simultaneously. Each additional FP addition statement increases the compiled binary size by 6-7 bytes—where FMA+FADD packaging reduces size, while clause/four-word headers increase overhead. Based on the relative baseline size increase, the instruction cache capacity is likely around 8KB, comparable to Qualcomm’s Adreno 6xx.

Each execution engine can track the states of up to 16 thread bundles, with each thread bundle corresponding to 8 lock-step executing API thread vectors. The hardware hides latency by switching between thread bundles, similar to CPU SMT technology.Bifrost employs a clause-based instruction set architecture to simplify scheduling: instructions packed into clauses execute atomically, with architectural state only clearly defined between clauses; each clause allows only one instruction to access long-latency/variable-latency units (such as memory pipelines) external to the EU.

Memory dependencies are managed between clauses, requiring instructions that need to access memory data to be placed in separate clauses. Six software-managed scoreboards clearly define cross-clause dependencies. This design significantly reduces the pressure on scheduling hardware, as it only needs to query the scoreboard at clause boundaries rather than judging each instruction individually. Bifrost shares a similar philosophy with AMD’s TeraScale (which also uses clauses), but the implementation details differ: TeraScale groups instructions by type (e.g., mathematical instructions into ALU clauses, memory accesses into texture/vertex fetch clauses).

From a programmer’s perspective, the Mali-G52 theoretically supports up to 768 active work items (8 channels/thread bundles × 16 thread bundles/EE × 6 EEs). The actual number of active threads (or occupancy) depends on available parallelism and register usage: Bifrost ISA provides a maximum of 64 registers, but using more than 32 will halve the theoretical occupancy (implying a register file capacity of 16KB), with no intermediate allocation steps. In contrast, Qualcomm’s Adreno 6xx can only achieve maximum occupancy when each thread uses 12 registers.

5. Register Access and Execution Pipeline: Flexible Data Processing Capabilities

Instruction execution is divided into three stages: register access, FMA, and FADD. In the register access stage, Bifrost instructions directly control operand collectors to read input data and write the results of previous instructions. Each EE’s register file is set with four ports: two dedicated to reading, one for writing, and another that can read/write generically. Since the FMA/FADD pipeline nominally requires five input sources, register read bandwidth is very limited. If the previous instruction needs to write both FMA and FADD results simultaneously, the register bandwidth constraints become even more severe.

To alleviate the pressure on the register file bandwidth, Bifrost can obtain input data through unified/constant ports (providing 1×64-bit or 2×32-bit immediate values). It also introduces **”temporary registers”**—essentially software-controlled forwarding paths used to temporarily store the results of previous instructions. Finally, since the FADD unit is in a later pipeline stage than the FMA unit, FADD can directly use the output of FMA as an input source.

The temporary registers, software-managed operand collectors, and register bandwidth constraints in Bifrost may resonate with developers familiar with AMD’s TeraScale 2 architecture: TeraScale 2 feeds five VLIW channels with 12 register files (requiring up to 15 inputs), and its compiler also employs a combination strategy of register reuse, temporary registers, and constant reads to maintain supply to execution units. Similar to Bifrost, TeraScale’s PV/PS temporary registers are only valid between consecutive instructions in the same clause, effectively reducing both register bandwidth and allocation demands. The difference is that TeraScale 2’s result write-back does not share register file bandwidth with reads, so the use of temporary registers is not as critical as in Bifrost.

Bifrost execution engine functional units retain support for narrow-width data types FP16 performance is particularly beneficial for pixel shading

Bifrost’s execution pipeline demonstrates remarkable flexibility when processing different data types, maintaining effective 256-bit vector execution (4-wide EE variants are 128 bits) when using 32/16/8-bit data types. During Bifrost’s development, machine learning research was booming, and Arm did not introduce dedicated matrix computation units like Nvidia Volta, but ensured that vector execution units could increase throughput as precision decreases.

Bifrost incurs no divergence penalty when eight API thread branches are consistent, while Adreno 6xx requires 64 API threads to be directionally consistent

Qualcomm’s Adreno 615 employs a differentiated execution strategy: its 64-wide thread bundles are paired with execution units of equal width. This makes Adreno 615 more susceptible to branch divergence, but allows Qualcomm to control more parallel execution units with a single instruction. Adreno 615 integrates 128 FP32 channels across the GPU (all supporting multiply-accumulate operations), but operates at only 430MHz. While Mali-G52 executes only 48 FP32 FMA operations per clock cycle, it can achieve 96 FP32 operations through FMA+FADD dual dispatch. Combined with a higher main frequency of 800MHz, Mali-G52’s FP addition throughput is comparable to Adreno 615. However, in terms of multiply-accumulate operations, Adreno 615 performs better, with peak performance just exceeding 100 GFLOPS.

It is important to note that it is “multiply-accumulate” rather than “fused multiply-accumulate”—the latter performs only one rounding after multiplication and addition (using higher precision for intermediate calculations), which can enhance accuracy.Adreno clearly lacks fast-path FMA hardware, as requiring FMA precision through the OpenCL fma function demands over 600 cycles per thread bundle. Bifrost does not have this issue. Both mobile GPUs perform excellently on FP16, with execution rates twice that of FP32.

Special functions (such as reciprocal square root) are executed through Bifrost’s FADD pipeline at half the rate of base operations (or 1/4 when considering FMA+FADD dual dispatch). Compared to Midgard, Arm optimized Bifrost’s handling of complex operations: only the most common built-in functions exposed by APIs like OpenCL are handled by a single instruction, while more complex special operations require multiple instructions. Adreno executes special functions at an even lower rate, only 1/8.

Integer operations in Bifrost are divided similarly to floating-point operations: integer addition goes through the FADD pipeline, while multiplication uses the FMA pipeline. Adreno’s homogeneous architecture excels in addition; Bifrost’s dual pipelines support full-speed integer operations, thus having an advantage in integer multiplication.

Low-precision INT8 operations perform excellently on Bifrost, but exhibit a “glass jaw” characteristic on Adreno (performance drops sharply in specific scenarios). Clearly, Qualcomm has not implemented fast-path INT8 hardware, but can execute INT8 operations on INT32 units and then apply an 8-bit mask to the results. TeraScale 2 also lacks INT8 hardware, simulating rates that are roughly halved. Both mobile GPUs do not support FP64.

Bifrost execution engine is designed for maximum flexibility, providing stable performance across a broader range of operations compared to Qualcomm Adreno. Adreno, on the other hand, is clearly optimized for graphical rendering depth: rasterization does not require the high precision of fused multiply-accumulate, nor the low precision of INT8 operations. Qualcomm integrates various accelerators in its Snapdragon chips, which may explain its lack of pursuit for high performance across all scenarios in the GPU. Arm’s licensing business model prevents it from assuming that chips will necessarily include other accelerators, and the design of Bifrost reflects this reality.

6. Memory Pipeline: Independent Paths for Textures and Load/Store

The Bifrost memory subsystem includes independent texture and load/store paths, each equipped with dedicated caches. EEs access these memory pipelines through the message network within the shader core. Intel uses similar terminology: EUs send messages to access memory via the internal message exchange network of sub-slices. AMD’s CU/WGP and Nvidia’s SM link execution unit partitions to shared memory pipelines through some interconnect architecture; Arm and Intel’s core internal networks may be more flexible, as they support a variable number of execution unit partitions.

Arm documentation states that both the load/store cache and texture cache of the Mali-G52 are 16KB, but latency tests show that the texture cache is 8KB. These parameters may be customized based on manufacturer needs, and different Bifrost SKUs do indeed have differences: for example, the first-generation Bifrost variant Mali-G71 specifies a 16KB load/store cache + 8KB texture cache in its specifications. Bifrost only supports a maximum of 64KB one-dimensional textures.

OpenCL Memory Latency

The pointer chasing latency of the texture cache is slightly higher than that of the load/store cache, which is expected. However, it is worth noting that some GPUs (such as AMD TeraScale) have TMU execution index addressing faster than the computation of programmable shader execution units.

GPU Global Memory Latency

Compared to Adreno and Intel Gen 9, Bifrost’s texture cache bandwidth is lower. The OpenCL read_imageui function returns a vector containing four 32-bit integers, which can be seen as a form of sampling. Mali-G52 passes 26.05 bytes per SC cycle through read_imageui, consistent with Arm’s documentation stating that “large SC variants process 2 samples per clock.” Adreno 615 achieves 61.3 bytes per uSPTP cycle (4 samples), allowing it to maintain an advantage despite its lower frequency. It is speculated that Qualcomm chose to optimize the throughput of Adreno’s texture pipeline rather than its cache capacity, as its 1KB texture cache is minuscule by any measure.

Texture Cache Latency

The low cache bandwidth of mobile GPUs is astonishing—the L1D read bandwidth of the four A73 cores in the Amlogic S922X can reach 120GB/s. The global memory bandwidth for computational applications is also limited, with Adreno 615 and Mali-G52 performing similarly: the Bifrost SC passes 16 bytes per cycle in global memory bandwidth tests, while Adreno 615 uSPTP loads 32 bytes from L2 per cycle.

Global Memory Cache Bandwidth

Bifrost’s L1 may pass 32 bytes per cycle (matching the texture cache), as it can achieve this value when using float4 to load local memory. Although a float4 version of the global memory bandwidth test has not been written, the local memory test results should be indicative:

Local Memory Bandwidth

Bifrost essentially “abandoned its position” in local memory handling: the GPU programming APIs provide workgroup local memory space (referred to as local memory in OpenCL, shared memory in Vulkan), typically supported by dedicated on-chip storage (such as AMD’s local data sharing, Nvidia/Intel reserving part of the cache for local memory).

The Mali GPU has not implemented dedicated on-chip shared memory for compute shaders; shared memory is no different from ordinary memory types, both supported by system RAM with load-store caches.

—— Arm Mali GPU Best Practices Development Guide

Bifrost does not provide any special treatment for local memory. OpenCL kernels can allocate a maximum of 32KB of local memory, but access does not guarantee staying on-chip. Worse, each shader core allows only one workgroup to allocate local memory, even if that workgroup does not need the full 32KB.

Global/Local Memory Latency, Shared Virtual Memory

GPUs with on-chip storage supporting local memory (including Adreno) can achieve better latency than Bifrost. Qualcomm discloses that Adreno X1 allocates local memory from GMEM, and its previous generation architecture may do the same. However, Qualcomm may not have a bandwidth advantage, as GMEM access also seems limited to 32 bytes per cycle.

7. L2 Cache and System Hierarchy: Challenges of Chip Integration

Bifrost’s L2 functionality is similar to the L2 caches of modern AMD/Nvidia GPUs: a write-back cache composed of multiple slices to support scalability. Arm expects Bifrost to provide 64-128KB L2 per SC (Midgard had 32-64KB). Amlogic chose the lower limit, so Mali-G52 is equipped with 128KB L2. A hypothetical 32 SC Bifrost GPU could feature 2-4MB L2.

On the Amlogic S922X, the texture-side L2 latency is slightly better than that of Adreno 615. However, Adreno 615 has better L2 latency for global memory access, as it does not need to go through L1 checks. Despite Mali-G52’s 128KB L2 capacity being twice that of the other two, its L2 bandwidth is still slightly inferior to Adreno 615. However, the Snapdragon 670 is equipped with 1MB of system-level cache, which may mitigate the disadvantage of the small L2 on the GPU side.

Global Memory Bandwidth

Bifrost performs well in latency when using atomic compare-and-swap operations to pass data across threads. While it is faster than Adreno 615 when using global memory, Qualcomm provides lower latency for atomic operations in local memory.

Cross-Thread Atomic Swap Latency

GPUs often handle atomic operations with dedicated ALUs close to L2 or local memory backup storage.Bifrost’s INT32 atomic addition throughput is less than satisfactory compared to Intel, and is far behind contemporary AMD/Nvidia discrete GPUs.

Atomic Addition Throughput

L2 miss requests are transmitted to the DRAM controller via the on-chip network. Amlogic chose a 32-bit DDR4-2640 interface for the S922X, with a theoretical bandwidth of 10.56GB/s. The chip’s system-level architecture ultimately depends on the manufacturer rather than Arm, which will affect Bifrost’s system-level feature support.

To align with Bifrost’s computational vision, Arm designed L2 to accept listening requests. Combined with compatible on-chip interconnects and CPU complexes, Bifrost can support OpenCL shared virtual memory with fine-grained buffer sharing, enabling data sharing between CPU and GPU without explicit copy or mapping/unmapping operations.Clearly, Amlogic’s solution is incompatible with this feature, as Mali-G52 only supports coarse-grained buffer sharing. Worse, it seems to copy the entire buffer at the underlying level during mapping/unmapping operations.

Local/GPU Transfer Latency, Shared Virtual Memory

Qualcomm, on the other hand, has full control over chip design: Adreno 615 supports zero-copy features, and its on-chip network possesses all the characteristics to implement this functionality.

Although modern GPUs support zero-copy data sharing with CPUs, copy performance remains crucial—it is the fundamental means of data transfer and result retrieval, and it concerns the release and reuse of buffers after data is copied to the GPU.The copy bandwidth between the host and GPU of the Amlogic S922X is just above 2GB/s (one copy requires reading and writing once each, equivalent to 4GB/s bandwidth), which aligns with the previously mentioned global memory bandwidth tests, indicating that the DMA unit does not have an advantage over the shader array in accessing DRAM bandwidth.

Local/GPU Copy Bandwidth

Adreno 615 exhibits superior copy performance, likely due to its faster LPDDR4X DRAM interface. However, the speed of copying data back from the GPU is extremely slow—given that games primarily transfer data to the GPU, this again indicates that Adreno is optimized for graphical rendering depth.

8. Rasterization and Tiled Rendering: Bandwidth Optimization Tools for Mobile GPUs

Adreno and Mali also face common issues: mobile devices cannot afford high-bandwidth DRAM interfaces comparable to desktop CPUs, let alone GPUs equipped with wide GDDR. However, graphics rasterization is a bandwidth-sensitive task, and ROP is a major source of bandwidth pressure. When pixel/fragment shaders output colors to ROP, ROP must ensure that results are written in the correct order, which may involve depth testing or alpha blending operations specified by the application.

(In AMD’s TeraScale architecture, the FIFO between vertex shaders and rasterizers is referred to as parameter caches and position buffers. Immediate mode rasterizers can certainly pass VS outputs through L2 or even DRAM. Additionally, TeraScale’s ROP is not an L2 client, as its L2 is a read-only texture cache. The way GPUs handle the rasterization pipeline varies widely across architectures.)

Both Adreno 6xx and Bifrost employ tiled rendering techniques to reduce ROP-end DRAM traffic: dividing the screen into rectangular tiles and rendering them one by one. The tile size is carefully designed to fit on-chip buffers (used to temporarily store intermediate results for alpha blending and depth testing), and only writes back to DRAM after the tile rendering is complete. Tiled rendering requires constructing a visible triangle list for each tile after the vertex shader completes coordinate calculations, and then the GPU reads back these lists when rasterizing tile by tile. Processing triangle lists generates DRAM traffic, which, if not handled properly, may negate the benefits of tiled rendering.

Bifrost continues the layered tiled strategy of Midgard: the standard tile is 16×16 pixels, but Arm can adopt larger power-of-two tile sizes to accommodate triangles, reducing cross-tile triangles (and thus reducing duplicate references in multiple triangle lists). Compared to Midgard, Arm restructured the memory architecture of the tiler to support finer-grained allocation without minimum buffer size restrictions. Finally, Bifrost can cull micro-triangles that have no substantial impact on pixel output during the tiler stage, reducing unnecessary pixel/fragment shader work. These optimizations collectively reduce bandwidth usage and memory footprint during the vertex and pixel shader stages. Arm also optimized bandwidth during the tile write-back stage—”transaction elimination” technology compares the CRC checksum of the corresponding tiles of the current frame and the previous frame, skipping write-back operations when matched, achieving an effective balance between logical resource and memory bus usage.

Tile-based GPU architectures are designed to minimize external memory access during the rendering process Mali divides the screen into 16×16 tiles for rendering performing fragment shading on each small tile writing the final results to memory splitting each rendering channel into two independent processing stages:

  • • Execute all geometry processing and generate tile lists
  • • Execute all fragment processing tile by tile

Since Bifrost uses 256-bit tile storage per pixel, the tile memory capacity is at least 8KB. Arm further suggests that tile memory is attached to each SC, so Mali-G52 may share 16KB of tile memory across two SCs. Adreno 615 also employs tiled rendering, using 512KB of tile memory (referred to as GMEM) to store intermediate tile states.

9. Computational Performance: FluidX3D Real-World Application Testing

FluidX3D is a GPGPU computing application that simulates fluid behavior. Its FP16S/FP16C mode stores data in 16-bit FP format to reduce DRAM bandwidth requirements, but to maintain precision, computations are still performed on FP32 values (requiring additional instructions for 16/32-bit format conversion).

Both Adreno 615 and Mali-G52 are more limited by computational capability than bandwidth, so the FP16 format does not help improve performance. FluidX3D defaults to using FMA operations, which is a fatal blow to Adreno 615, which lacks fast-path FMA hardware. Replacing FMA with multiply-accumulate operations improves Qualcomm’s performance somewhat. However, Adreno 615’s final performance remains unsatisfactory—despite having higher theoretical FP32 throughput and memory bandwidth, its actual performance lags behind that of Mali-G52.

10. Power Management: Fine-Grained Power Control Strategies

The Mali-G52 is divided into four power domains: the always-on “GL” domain may be responsible for listening for wake commands; when the GPU needs to handle 3D tasks or parallel computing, the “CG” domain powers on; shader cores are enabled as needed. Each SC is located in an independent power domain, allowing drivers to enable the Mali shader array for lightweight tasks.

Energy savings can also be achieved by adjusting clock frequencies: Amlogic S922X seems to generate the Mali-G52 clock based on a 2GHz “FCLK” through different frequency division ratios.

Clock Frequency Division Settings
800 MHz 2 GHz × 2/5
666 MHz 2 GHz × 1/3
500 MHz 2 GHz × 1/4
400 MHz 2 GHz × 1/5
285 MHz 2 GHz × 1/7

11. Conclusion: Market Value and Technical Features of the Bifrost Architecture

Arm’s business model relies on the attractiveness of its IP modules to a wide range of manufacturers.Bifrost’s highly parameterized design and micro-building modules perfectly fit this model, making it an ideal solution for precisely achieving specific power, area, and performance targets in the low-power integrated GPU niche market.

Hypothetical Diagram of a 32 Shader Core Bifrost GPU

While Qualcomm’s Adreno targets the same devices, it employs larger building modules. This strategy is more suitable for scaling architecture and aligns with Qualcomm’s ambitions to enter the laptop market. However, large module designs make it difficult to construct micro GPUs—it seems Qualcomm exhausted all tuning methods to shrink Adreno 6xx, only to be forced to drop the clock to 430MHz to meet the design goals of Snapdragon 670.

Diagram of Large Implementation of Adreno 6xx Architecture, Adreno 690 (ROP count still uncertain, indicated by red blocks)

In addition to scaling flexibility, Bifrost broadens Arm’s market by pursuing performance consistency across workloads—this is both an advantage compared to Qualcomm Adreno and an improvement over the previous Midgard. The execution units of Bifrost delightfully avoid the “glass jaw” performance characteristic, and its GPU-to-CPU copy bandwidth also surpasses that of Adreno. Adreno is not marketed as an independent module, focusing more on graphical rasterization. Qualcomm may expect common mobile computing tasks to be offloaded to other modules (such as Hexagon DSP).

Overall, Bifrost provides an excellent template for optimizing low-power GPU designs to cover a wide range of application scenarios. It exudes the essence of the evolution from TeraScale to GCN, yet is balanced between the two. Reproducing TeraScale features such as clause-based execution, software-controlled operand collectors, and temporary registers in a 2016 GPU architecture is indeed fascinating. Clearly, the technology that enabled TeraScale to achieve trillions of FP32 computations at the 40nm node remains effective in meeting stringent power targets at new process nodes. Since Bifrost, Arm has continued to modernize its GPU architecture with a dual focus on graphical rasterization and general computing. As a unique and captivating GPU designer, its future direction is worth looking forward to.

This article is a translation and reconstruction of Arm’s Bifrost Architecture and the Mali-G52[1]

Reference Links

<span>[1]</span> Arm’s Bifrost Architecture and the Mali-G52: https://old.chipsandcheese.com/2025/05/09/arms-bifrost-architecture-and-the-mali-g52/

Leave a Comment