Armv9 Technology Lecture: Accelerating Video Decoding and Image Processing with Armv9 CPU and SVE2

This article is reprinted from the Extreme Technology Community.
Extreme Technology Column: Arm Technology Blog
Author:Poulomi Dasgupta, Senior Manager of Consumer Computing Market, Arm Terminal Division

Source: Armv9 Technology Lecture | Utilizing Armv9 CPU and SVE2 to Accelerate Video Decoding and Image Processing

With each new product generation, Arm CPUs achieve intergenerational performance improvements and introduce architectural enhancements to meet the evolving demands of computing workloads. This article will focus on three use cases to demonstrate the impact of Armv9 CPU’s architectural features in real-world scenarios, particularly in HDR video decoding (10% acceleration), image processing (20% acceleration), and the functionality of LibYUV in major mobile applications (26% acceleration).

The good news is that some of the Arm SVE2 optimizations discussed in this article are now available for developers, promising to enhance user experience in popular media applications and further improve how people communicate, work, and entertain themselves.

Challenges Faced by Application Developers and OEM Manufacturers

First, from the perspective of mobile application developers, there are currently over 2 million Android applications in the market competing for user attention. To remain competitive, these applications must quickly bring innovations to various mobile devices. Relying on fixed-function hardware poses challenges regarding time-to-market and portability.

Metrics related to excellent user experience, including application launch time, UI smoothness, tokens per second, and frames per second (FPS) stability, must meet user expectations. Therefore, OEM manufacturers need to balance performance improvements with broader user demands, such as extended battery life, reduced data usage, and device costs. Any shortcomings in either area may lead to user dissatisfaction and undermine the value of upgrading mobile devices.

Developing software on Armv9 CPU can address the challenges faced by OEM manufacturers and developers.

Real-World Use Cases of SVE2 in Armv9 CPU

Let’s look at three case studies that demonstrate how software optimization can accelerate real workloads. First, here are new vector instructions in the Armv9 CPU that accelerate key workloads on mobile devices, which are a subset of SVE2:

  • 16-bit dot product and 8-bit matrix multiplication, accelerating HDR video playback and video conferencing.

  • Image processing histogram instructions.

  • Gather read and scatter store for interleaving processing of camera sensor data.

  • Complex instructions for accelerating fast Fourier transforms in video codecs.
Using these vector instructions allows optimized software to use fewer CPU cycles, resulting in two major benefits. First, the reduction in CPU cycles leads to lower energy consumption, increasing battery life; second, it enhances application performance.

Case 1: SVE2 Increases Video Decoding Speed by 10%

Watching multimedia content is one of the most common workloads on mobile devices and a major source of mobile network traffic. Therefore, manufacturers continuously strive for more efficient codecs that support excellent image quality while saving network bandwidth.

HDR technology presents more realistic details due to higher color accuracy, even in very dark or very bright scenes. It uses 10 bits instead of 8 bits to represent each color channel. AV1, VP9, and other modern codecs support HDR video.

AV1 is a newer format that offers better compression, while VP9 has broader compatibility across various browsers and devices. Some popular applications use AV1 and VP9 formats to play videos.

SVE2 optimization improves HDR video decoding speed by approximately 10%, VP9 decoding speed by 8%, and AV1 decoding speed by 10%. This results in a reduction of CPU cycles by about 10%, correspondingly decreasing power consumption, allowing users to enjoy longer battery life when playing on-demand videos on mobile devices. Consequently, whether watching short clips or long videos, the experience becomes smoother!

Optimized code for libdav1d (AV1 decoder) and libvpx (VP9 decoder) has been uploaded and is now available for developers.

Case 2: SVE2 Increases LibYUV Speed by 26%

It is worth noting that we all unknowingly use LibYUV.

LibYUV is an open-source library used for color space conversions between RGB and YUV, scaling camera sensor data, and applying camera filters and rotations. It processes data from the camera sensor before it is used by the video decoder. In many cases, the data in the video decoder is first processed through LibYUV before being sent for display.

SVE2 optimization increases the speed of LibYUV by 26% (geometric mean across multiple cores on Armv9 CPU). Approximately 100 kernels in LibYUV have been optimized with SVE2, and optimization work for other kernels is ongoing. Some work has been uploaded and can be viewed at https://chromium.googlesource.com/libyuv/libyuv/.

LibYUV is distributed as part of Chromium. Chromium is an open-source browser project that underpins Chrome and custom browsers from major mobile manufacturers (including Xiaomi Browser and Samsung Browser). It is also integrated into AOSP and Android Jetpack. Given its critical role for mobile devices, LibYUV is expected to have a profound impact on overall mobile experience, such as improving video conferencing, smoother transitions between portrait and landscape modes, and better video consumption experiences, while significantly extending battery life.

Case 3: SVE2 Increases Computational Photography Speed by 20%

Halide is a language specifically designed for the field of image processing, used in applications like Adobe Photoshop, and some OEM manufacturers also use it in camera pipelines.

SVE2 instructions (such as gather read and scatter store instructions) and TBL (table lookup for vectorizing small query tables) accelerate some key computer vision processes in Halide. Computationally intensive algorithms such as iToFDepth (for depth perception), bilateral grid (for edge-aware tone mapping), and local Laplacian (for filters) have seen performance improvements of nearly 20% with SVE2.

Using SVE2 to optimize software allows for real-time application of certain photographic effects, opening new possibilities for entry-level mobile devices, enabling users to achieve higher-quality photos without dedicated hardware.

Arm has optimized the Halide backend for SVE2 code generation. The good news is that some patches have already gone live, while others are in development.

Armv9 Technology Lecture: Accelerating Video Decoding and Image Processing with Armv9 CPU and SVE2

Image: Comparison of CPU cycles between Halide-SVE2 and Halide-Neon

Armv9 Technology Lecture: Accelerating Video Decoding and Image Processing with Armv9 CPU and SVE2

Image: Example image with depth effect

Armv9 Technology Lecture: Accelerating Video Decoding and Image Processing with Armv9 CPU and SVE2

Image: Example image with edge-aware tone mapping

How to Better Utilize SVE2?

SVE2 introduces several new instructions that are particularly suitable for accelerating critical real workloads and applications. We will discuss in more detail how to utilize Armv9 CPU to achieve performance improvements in subsequent technical articles, so stay tuned to the “Arm Community” WeChat public account!

Arm is committed to finding a good balance for the ecosystem, better supporting developers and enhancing performance. Some open-source libraries and kernels optimized for SVE2 are now live, and more resources will be available in the future.

The latest advancements in Armv9 CPU will enable developers to innovate faster, bringing better user experiences to end consumers across various mobile devices. What are you waiting for? Start your development project with SVE2 to achieve innovation!

Recommended Reading:
  • Unveiling the Blind Spots of CPU Utilization on SMT Systems
  • Accelerating Spark SQL on AArch64 with Gluten and Velox
  • Implementing a Flexible Software-Defined Automotive Architecture

Reprinted from | Extreme Technology Community

Copyright belongs to the original author. If there is any infringement, please contact for deletion.

END







关于安芯教育








安芯教育是聚焦AIoT(人工智能+物联网)的创新教育平台,提供从中小学到高等院校的贯通式AIoT教育解决方案。
安芯教育依托Arm技术,开发了ASC(Arm智能互联)课程及人才培养体系。已广泛应用于高等院校产学研合作及中小学STEM教育,致力于为学校和企业培养适应时代需求的智能互联领域人才。


Leave a Comment