RISC-V Performance Counters and Timer Extensions – Zicntr and Zihpm

Introduction:

RISC-V (pronounced “risk-five”) is a new instruction set architecture (ISA) originally designed to support research and education in computer architecture. However, in recent years, it has evolved into a standardized, free, and open architecture for industrial implementations.

RISC-V Performance Counters and Timer Extensions - Zicntr and Zihpm

One significant advantage of the RISC-V instruction set is its scalability, allowing users to select appropriate extension instruction sets based on specific application needs.

There are numerous RISC-V extensions, which will be introduced one by one. This article focuses on the Zicntr and Zihpm extensions. (Note: The main content is based on the official spec version 20240411)

Background: The RISC-V ISA provides a set of up to 32 64-bit performance counters and timers, which users can access through non-privileged XLEN-bit read-only CSR registers <span>0xC00</span><span>0xC1F</span> (when XLEN=32, the upper 32 bits can be accessed through CSR registers <span>0xC80</span><span>0xC9F</span>). These counters are divided into the “Zicntr” and “Zihpm” extensions.

RISC-V Performance Counters and Timer Extensions - Zicntr and Zihpm

“Zicntr” Extension: Basic Counters and Timers

The Zicntr standard extension includes the first three basic counters (CYCLE, TIME, and INSTRET), which are used for CPU cycle counting, real-time clock, and number of committed instructions, respectively. The Zicntr extension relies on the Zicsr extension (access to these registers requires CSR access instructions).

The official recommendation is to provide these basic counters in implementations, as they are crucial for basic performance analysis, adaptive and dynamic optimization, and allow applications to handle real-time data streams. The Zihpm extension provides additional counters that help diagnose performance issues and should be made accessible to user-level application code with low overhead.

Additionally, certain execution environments may prohibit access to counters, for example, to prevent timing side-channel attacks.

The RISC-V Instruction Set Manual Volume I

RISC-V Performance Counters and Timer Extensions - Zicntr and Zihpm

The “Zicntr” extension specifies six pseudo-instructions as shown in the above image.For basic ISA with XLEN ≥ 64, CSR instructions can directly access the full 64-bit CSR. Specifically, the RDCYCLE, RDTIME, and RDINSTRET pseudo-instructions can read the full 64-bit data of the <span>cycle</span>, <span>time</span>, and <span>instret</span> counters.

The counter pseudo-instructions map to the read-only csrrs rd, counter, x0 normative form, but other read-only CSR instruction forms (based on CSRRC/CSRRSI/CSRRCI) are also valid ways to read these CSRs.

The RISC-V Instruction Set Manual Volume I

For the basic ISA with XLEN=32, the Zicntr extension allows these three 64-bit read-only counters to be accessed in 32-bit segments. The RDCYCLE, RDTIME, and RDINSTRET pseudo-instructions provide the lower 32 bits, while RDCYCLEH, RDTIMEH, and RDINSTRETH pseudo-instructions provide the high 32 bits of their respective counters.

The instruction set architecture requires that the counters maintain a 64-bit width even when XLEN=32; otherwise, it would be difficult for software to determine if an overflow has occurred. For lightweight implementations, the high 32 bits of each counter can be implemented by a software counter incremented by a trap handler triggered by an overflow of the low 32 bits. The following example code demonstrates how to safely read the complete 64-bit value using a single 32-bit wide pseudo-instruction.

The RISC-V Instruction Set Manual Volume I

Next, we will introduce these pseudo-instructions one by one.

RDCYCLE[H]

RDCYCLE pseudo-instruction reads the <span><span>cycle</span></span> CSR’s low XLEN bits, which stores the number of clock cycles executed by the processor core since an arbitrary starting time. When XLEN=32, RDCYCLEH only exists and reads the 63-32 bits of the same cycle counter. The base 64-bit counter should not overflow. The increment rate of the cycle counter will depend on the implementation and operating environment. The execution environment should provide a method to determine the current rate of increment of the cycle counter (cycles/second).

The purpose of RDCYCLE is to return the number of cycles executed by the processor core, not the cycles of hardware threads. Defining “core” precisely is challenging due to certain implementations (e.g., AMD Bulldozer). Similarly, precisely defining “clock cycle” is also challenging due to significant variations in implementation (including software emulation). However, the primary purpose of RDCYCLE is to be used for performance monitoring alongside other performance counters. Specifically, when there is only one hardware thread/core, the cycle count/instruction retirement can be used to measure the hardware thread’s CPI.

When there are multiple hardware threads/cores and dynamic multithreading, it is often impossible to accurately distinguish the cycle count for each hardware thread (especially in SMT). A separate performance counter may be defined to capture the cycle count executed by a specific hardware thread, but this definition must be very vague to accommodate all possible thread implementations. The complexity of defining each hardware thread’s cycle count, along with the need to adjust per-core cycle counts when multithreaded code is modified, leads to the standardization of per-core cycle counts being the best choice, which also applies to common single hardware thread/core cases.

Standardizing what happens during “sleep” is not realistic, as “sleep” does not have a unified definition across different execution environments. However, if the entire core enters a suspended state (fully clock-gated or deep sleep with power off), the core will not execute clock cycles, and the cycle count should remain unchanged as per the specification.

Although there is no precise definition applicable to all platforms, this remains a useful feature for most platforms. An imprecise, common, and generally correct standard is better than no standard at all. The design intent of RDCYCLE is for performance monitoring and tuning, and the official specification is centered around this.

RDTIME[H]

RDTIME pseudo-instruction reads the low XLEN bits of the “time” CSR, which counts the actual time elapsed since an arbitrary starting time. When XLEN=32, RDTIMEH only exists and reads the 63-32 bits of the same real-time counter. The base 64-bit counter increments with each tick of the real-time clock, and it should not overflow for actual real-time clock frequencies. The execution environment should provide a method to determine the period of the counter ticks (seconds/tick). This period should remain constant within a small margin of error. The environment should provide a method to determine the accuracy of the clock (i.e., the maximum relative error between the nominal and actual real-time clock periods).

On some simple platforms, the cycle count may be an effective implementation of RDTIME, in which case RDTIME and RDCYCLE may return the same results.

Given the wide variety of possible implementation platforms, it is difficult to strictly define clock cycles. The maximum error margin should be set according to the platform’s requirements.

The RISC-V Instruction Set Manual Volume I

All hardware threads’ real-time clocks must be synchronized to within one tick of the real-time clock.

Similar to other architectures, as long as hardware threads appear to be “synchronized” to within one tick of the real-time clock, software cannot observe a greater difference between the real-time clock values observed on two hardware threads.

The RISC-V Instruction Set Manual Volume I

RDINSTRET[H]

RDINSTRET pseudo-instruction reads the <span><span>instret</span></span> CSR’s low XLEN bits, which counts the number of instructions retired by that hardware thread since an arbitrary starting point. When XLEN=32, RDINSTRETH only exists and reads the 63-32 bits of the same instruction counter. The base 64-bit counter should not overflow in practice.

Instructions that cause synchronization exceptions, including ECALL and EBREAK, are not considered retired and therefore do not increment the instret CSR.

The RISC-V Instruction Set Manual Volume I

The following code sequence will read a valid 64-bit cycle counter value into <span>x3:x2</span> even if the counter overflows between reading its high and low parts.

    again:        rdcycleh     x3        rdcycle      x2        rdcycleh     x4        bne          x3, x4, again

Example code for reading a 64-bit cycle counter when XLEN=32.

    again:        rdcycleh     x3        rdcycle      x2        rdcycleh     x4        bne          x3, x4, again

“Zihpm” Extension: Hardware Performance Counters

The Zihpm extension includes up to 29 additional non-privileged 64-bit hardware performance counters, <span><span>hpmcounter3-hpmcounter31</span></span>. When XLEN=32, the high 32 bits of these performance counters can be accessed through additional CSR <span>hpmcounter3h-hpmcounter31h</span>. The Zihpm extension also relies on the Zicsr extension.

In some applications, the ability to read multiple counters simultaneously is important. When running in a multitasking environment, user threads may encounter context switches when attempting to read counters. One solution is for user threads to read the real-time counter before and after reading other counters to determine if a context switch occurred during the read sequence, in which case they can retry the read. We considered adding output latches to allow user threads to atomically snapshot counter values, but this would increase the size of user contexts, especially for implementations with larger sets of counters.

The RISC-V Instruction Set Manual Volume I

The number and width of implementations of these additional counters, as well as the set of events they count, are platform-dependent. Accessing unimplemented or misconfigured counters may lead to illegal instruction exceptions or return constant values.

The execution environment should provide a method to determine the number and width of implemented counters, as well as the interface for configuring the events each counter is to count.

For execution environments implemented on RISC-V privileged platforms, the privileged architecture manual describes the privileged CSRs that control access to these counters in low-privilege mode and set the events to be counted.

Alternative execution environments (e.g., user-level software performance models) may provide alternative mechanisms to configure the events counted by performance counters.

Ultimately, standardizing event settings at the ISA level is crucial for metrics such as the number of floating-point instructions executed and some potentially common microarchitecture metrics, such as “L1 instruction cache misses.”

The RISC-V Instruction Set Manual Volume I

Related Reading:

How RISC-V ISA Instruction Length is Encoded

Introduction to RISC-V and Basic Instruction Set

Introduction to RISC-V Instruction Set – RV32I Instruction Set

Introduction to RISC-V Instruction Set – RV32E and RV64E Instruction Sets

Introduction to RISC-V Instruction Set – RV64I and RV128I Instruction Sets

RISC-V ISA Naming Conventions (Complete)

Leave a Comment