SIMD Acceleration of H.265 Decoding with WebAssembly FFmpeg

1. What is WebAssembly

WebAssembly is a bytecode format, a concept that is not new, similar to JVM and .NET bytecode. What makes WebAssembly special?

Browsers support running WebAssembly programs
C, C++, and Rust programs can be directly compiled into WebAssembly bytecode

It can be said that WebAssembly bridges the gap between the vast front-end web ecosystem and the highest performance programming languages (assembly, you remain silent). WebAssembly has two advantages over JS: one is performance, and the other is the ability to reuse existing C, C++, and Rust code.

In addition to browsers, WebAssembly has also expanded into other areas as a sandbox mechanism.WebAssembly System Interface (WASI) is a set of interface standards, with runtimes like wasmtime and wasmer that adhere to the WASI interface standard, similar to JVM. Below is a complete process from C code to running WebAssembly with wasmtime:

1. Hello FFmpeg Program

#include <stdio.h>
#include <stdlib.h>
int main() {printf("Hello FFmpeg\n");}

2. The wasi SDK compiles the C program into WebAssembly

$ wasi-sdk-24.0-x86_64-linux/bin/clang hello.c -o hello.wasm

3. wasmtime executes WebAssembly

$ wasmtime hello.wasm Hello FFmpeg

2. WebAssembly SIMD Acceleration Scheme

General pure computational C/C++ programs can be directly compiled into WebAssembly. CPU-intensive tasks often use assembly for acceleration. To enhance computational performance, WebAssembly has also defined its own SIMD instructions, see https://www.w3.org/TR/wasm-core-2/#vector-instructions. The simd128 instruction set of WebAssembly can process 128 bits and 16 bytes of data at once. There is also a relaxed SIMD instruction set currently in the proposal stage.

Emscripten is the pioneer of WebAssembly, and it lists five ways to use WebAssembly SIMD:https://emscripten.org/docs/porting/simd.html. In summary, there are three ways:

Let the compiler perform automatic vectorization compilation
Rewrite using WebAssembly SIMD intrinsics or GCC/Clang SIMD vector extensions
Let the compiler automatically translate existing X86 or ARMintrinsics into WebAssembly

In terms of convenience, option 1 > option 3 > option 2. In terms of performance, option 2 > option 3 > option 1.

Some may not understand what SIMD intrinsics are.Intrinsics are intrinsic functions, and SIMD intrinsics are a set of high-level language wrappers for SIMD instruction sets. For example, the API for ARM NEON intrinsics can be found athttps://github.com/gcc-mirror/gcc/blob/master/gcc/config/arm/arm_neon.h. You do not need to learn assembly language; calling the C language API in arm_neon.h allows the compiler to generate NEON instructions for SIMD acceleration.

In contrast to SIMD intrinsics is handwritten assembly, which directly programs machine instructions, i.e., traditional assembly language programming. Handwritten assembly requires manual control of register usage, manual instruction reordering, etc. Compared to SIMD intrinsics, handwritten assembly can achieve higher performance.

Note that the methods listed by Emscripten for using WebAssembly SIMD do not include handwritten assembly. WebAssembly is a bytecode format, not real machine instructions. It has no registers, or rather, infinite registers. The equivalent of handwritten assembly is the text format of WebAssembly. Handwritten assembly can be optimized for specific CPU architectures, while the CPU running WebAssembly is unknown, and various optimization methods in handwritten assembly cannot be applied in the WebAssembly context. Therefore, I believe the correct use of WebAssembly SIMD is through SIMD intrinsics. Developers, compilers, and WebAssembly runtimes (V8, wasmtime, etc.) are in a cooperative relationship, not a competitive one. Some optimization work is delegated to the compiler, and some to the runtime’s JIT, etc.

3. Application Scenarios of WebAssembly Version FFmpeg

The web ecosystem is powerful, but the audio and video processing capabilities of the web are poor. There are some high-level APIs, such asHTMLMediaElement, MSE, WebRTC, which lack support for low-level control. Now with WebCodecs, there is still a distance to full coverage and complete functionality.

Since the advent of WebAssembly, FFmpeg has found a place in web-based audio and video processing. The WebAssembly version of FFmpeg can be used for:

Container format encapsulation and decapsulation
Audio encoding/decoding, filtering
Video encoding/decoding, filtering

The first two are not in doubt. Container format processing is a strong suit of FFmpeg, and the computational power required for encapsulation and decapsulation is small (compared to decoding and image processing). The data volume for audio encoding/decoding is not large, and the WebAssembly version of FFmpeg can handle it without pressure, but there is still significant room for optimization to further enhance performance.

The biggest question lies in video encoding/decoding. The WebAssembly version of FFmpeg can perform video encoding/decoding functionally, but the performance is far from sufficient. Why does FFmpeg, known for its speed, perform so poorly on WebAssembly? Where is the performance loss?

Execution of “WebAssembly bytecode” compared to the performance loss of “C language compiled into machine language running.” This part relies on the compiler and WebAssembly runtime for optimization.
Video encoding/decoding heavily relies on multithreaded parallel processing, and the WebAssembly environment may not support multithreaded processing, losing multithreaded parallel acceleration. There are already some solutions for WebAssembly multithreading support that can recover this performance loss.
All existing assembly code in FFmpeg is unusable.As mentioned earlier, LLVM can translate ARM and X86 SIMD intrinsics into WebAssembly SIMD. However, FFmpeg uses handwritten assembly (except for Loongson architecture) for better performance, and the handwritten assembly for various CPU architectures cannot be directly translated into WebAssembly SIMD. To fill this gap, it is necessary to rewrite the assembly acceleration using WebAssembly SIMD.

The greatest potential demand for video encoding/decoding on the web is likely H.265 decoding. The new version of Chrome supports H.265 hardware decoding, but there are version coverage issues and hardware decoding may not be available. For client apps, both soft decoding and hardware decoding of H.265 are highly feasible, and either can be used; combining both solutions can cover over 95% of devices (unless they are over a decade old). However, on the web, various solutions are lacking, heavily relying on fallback to H.264.

4. SIMD Acceleration of H.265 Decoding with WebAssembly FFmpeg

Based on the previous discussion, the key to improving FFmpeg video decoding performance lies in rewriting FFmpeg’s assembly acceleration using WebAssembly SIMD. The technical direction is clear, but no one in the FFmpeg community has undertaken this work. I speculate the reasons are:

FFmpeg video decoding is not the only solution on the web, nor can it be said to be the best solution. If the browser itself does a good job of H.265 decoding, FFmpeg becomes a backup role.
The workload of rewriting assembly acceleration is substantial, even if it is just rewriting the assembly acceleration for H.265.
There is little overlap between FFmpeg developers and web developers.
No company sponsorship.

I have been waiting for three years without any movement, including discussions about WebAssembly. I do not do web development, but I often see web developers using FFmpeg. Rather than continue waiting, it is better to poke the community and see the response.

I have written a WebAssembly SIMD version of H.265 IDCT (Inverse Discrete Cosine Transform; strictly speaking, H.265 IDCT is an approximation of the inverse discrete cosine transform). The patch can be found at https://ffmpeg.org/pipermail/ffmpeg-devel/2024-November/336009.html.There are four sizes for H.265 IDCT: 4×4, 8×8, 16×16, and 32×32. The acceleration effects of WebAssembly SIMD are as follows:

hevc_idct_4x4_8_c:                                      20.4 ( 1.00x)
hevc_idct_4x4_8_simd128:                                14.1 ( 1.44x)
hevc_idct_4x4_10_c:                                     17.9 ( 1.00x)
hevc_idct_4x4_10_simd128:                               14.1 ( 1.27x)
hevc_idct_8x8_8_c:                                     232.3 ( 1.00x)
hevc_idct_8x8_8_simd128:                                56.1 ( 4.14x)
hevc_idct_8x8_10_c:                                    222.1 ( 1.00x)
hevc_idct_8x8_10_simd128:                               60.9 ( 3.65x)
hevc_idct_16x16_8_c:                                  1619.1 ( 1.00x)
hevc_idct_16x16_8_simd128:                             384.6 ( 4.21x)
hevc_idct_16x16_10_c:                                 1543.1 ( 1.00x)
hevc_idct_16x16_10_simd128:                            391.4 ( 3.94x)
hevc_idct_32x32_8_c:                                 18518.3 ( 1.00x)
hevc_idct_32x32_8_simd128:                            2143.1 ( 8.64x)
hevc_idct_32x32_10_c:                                17633.3 ( 1.00x)
hevc_idct_32x32_10_simd128:                           2139.1 ( 8.24x)

The versions with the simd128 suffix are my SIMD-accelerated implementations, while the versions with the c suffix are the pure C logic implementations. Note that the c versions include compiler automatic vectorization processing. Here, we compare the effects of handwritten vectorization and compiler automatic vectorization. It can be seen that 8×8 and 16×16 achieve approximately 4 times acceleration, while 32×32 achieves 8 times acceleration. The 4×4 has a low acceleration ratio due to the small amount of data processed at once.

Let us look at the impact of optimizing just one function on decoding speed.

Configuration

Disable SIMD

Compiler Auto

Vectorization

Handwritten Optimized IDCT

Decoding Speed/FPS

Testing single-threaded decoding,video resolution 1080P, the machine is a Linux system with an Intel 12700 CPU. Here are three configurations:

Completely disable WebAssembly SIMD, the generated bytecode does not contain WebAssembly SIMD
Only enable compiler automatic vectorization
Enable compiler automatic vectorization, plus the handwritten implementation of the WebAssembly SIMD version IDCT

It can be seen that compiler automatic vectorization improved the decoding speed by 59%. The effect is significant, but on the other hand, the compiler can only achieve this level. The handwritten optimized IDCT further improved the decoding speed by 11%, noting that this only optimized one function.

Comparing the speed of the compiled local FFmpeg, with the speed without handwritten assembly:

$ ./ffmpeg -hide_banner \  -cpuflags 0 \  -threads 1 \  -i basketball-v265.mp4 \  -an -f null - \  -benchmark 
frame=  500 fps=104 q=-0.0 Lsize=N/A time=00:00:20.00 bitrate=N/A speed=4.14x    bench: utime=4.865s stime=0.035s rtime=4.830s

Speed with handwritten assembly enabled:

$ ./ffmpeg -hide_banner \  -threads 1 \  -i basketball-v265.mp4 \  -an -f null - \  -benchmark
frame=  500 fps=213 q=-0.0 Lsize=N/A time=00:00:20.00 bitrate=N/A speed=8.52x    bench: utime=2.380s stime=0.027s rtime=2.349s

Testing on a MacBook Pro with M1 chip, the data is similar; without enabling handwritten assembly optimization, it is 118FPS, and with it, it is 212FPS. Note that whether in the Linux Intel chip environment or the MacOS M1 environment, I used clang for compilation, which automatically enabled vectorization. If using gcc for compilation, due to bugs in gcc’s vectorization implementation, FFmpeg has disabled vectorization functionality, resulting in a single-threaded decoding speed of only 58 FPS without handwritten assembly, half the speed of clang.

Different encoding configurations also yield different decoding speeds. The video I used for testing has a considerable decoding complexity. I estimate that the fully optimized WebAssembly SIMD version of FFmpeg H.265 1080P decoding speed can reach 140 FPS. For simpler encoding configurations, the decoding speed can be even higher.

5. Outlook

From the testing results, the effect of compiler automatic vectorization is significant, and the effect of handwritten acceleration is even better. Achieving WebAssembly single-threaded decoding of 1080P at 140FPS on Intel 12700 and Apple M1 is not a problem.

The remaining question is whether this effort is worth the investment.Is FFmpeg video decoding on the web merely a backup role? From another perspective, is optimizing the performance of WebAssembly FFmpeg audio processing more valuable? I can decide what to do, but I am uncertain about the value after completion. I welcome discussions on this topic from web audio and video developers.

Related posts

Leave a Comment Cancel reply