Understanding the Digital Backend Implementation Report of ARM Cortex-A75 CPU

Understanding the Digital Backend Implementation Report of ARM Cortex-A75 CPU

The following image is the performance report of the Cortex-A75 CPU released by ARM. Today, the editor from Wuai IC Community will lead everyone to thoroughly understand this report from the perspective of digital backend design implementation.

Understanding the Digital Backend Implementation Report of ARM Cortex-A75 CPU

The first line contains information about RTL integration configuration. The design uses a Level 1 cache DCache of 64KB, a Level 1 cache ICache of 64KB, and a Level 2 cache size of 256KB, and integrates a Crypto encryption module, NEON, and floating-point unit FP. Generally, the Level 1 cache is placed in the CPU core, while the Level 2 cache is placed in the Non-CPU module. In the Non-CPU module, there is a very large module known as the SCU (Snooping Control Unit).

These configurations directly affect the final chip implementation area size. While meeting design requirements, the sizes of Level 1 and Level 2 caches should be minimized as much as possible, which can significantly reduce the chip area. The data in this image is a simple digital backend implementation data report provided by ARM to promote their processors, so in practical applications below 16nm technology, the sizes of L1 and L2 may be much larger than the values listed in the image.

The second line shows the frequency after digital backend implementation, 3.1GHz @TT/1.00v/85c/typical. This item succinctly indicates that this CPU can run at a maximum frequency of 3.1GHz under the conditions of TT85 degrees and a voltage of 1.0v (a specific PVT). Why choose the TT85c corner? The frequency can be higher, which can be used for marketing and promotion! In actual projects, although we also check the setup under TT85c, it is often only used for power analysis and not as a corner requiring timing signoff.

We know that on a wafer, it is impossible for the electronic drift speed at every point to be the same, and their characteristics will vary with different voltages and temperatures. By classifying them, we have PVT (Process, Voltage, Temperature). The following image is an excerpt from the static timing analysis (STA) red book, (a) The original book contains errors, which I have marked in red.

Understanding the Digital Backend Implementation Report of ARM Cortex-A75 CPU

The third line contains the power consumption information of the A75 core. The previous part is static leakage power, which is 70mw. The latter part is dynamic power (Dynamic Power), which is 270mw/GHz. Personally, I think this way of writing power consumption information is not rigorous enough. When analyzing power consumption, it is necessary to specify the working scenario and the patterns you are using.

Usually, when analyzing CPU dynamic power consumption, we adopt two common patterns, one is Dhrystone, and the other is Max Power. Dhrystone is a relatively old benchmarking program, mainly doing string copying, which can be easily optimized by software compilers and hardware. However, it has the advantage that the program is small, and the data volume is also small, allowing it to run only in Level 1 cache, so Level 2 cache and other circuits only have leakage.

Max Power is a maximum power mode, where most logic is kept in a high toggle state during simulation to obtain the highest operating power. This mode is obviously overly pessimistic because, in practical application scenarios, it is difficult to encounter such situations. Generally, this mode can be used to evaluate the power consumption of the chip under adverse conditions.

The fourth line contains the area information of the A75 CORE, which is 1.729mm2. It is important to note that when discussing chip area in processes of 28nm and above, it is essential to specify whether the area size is pre-shrinked or post-shrinked. This item directly affects the chip’s cost and determines its market competitiveness.

The fifth line indicates that the process used in this design A75 is TSMC TSMC16nm. TSMC16nm is further divided into many small nodes, such as FFLL+, FFC, etc. Choosing a suitable process node for your company’s products is crucial, as it involves trade-offs in chip performance, power consumption, area, and other aspects.

The sixth and seventh lines indicate that this design adopts TSMC TSMS16nm process, with a Metal Stack of 11m_2xa1xd3xe2y2r_utrdl 9Track library (a total of11metal layers) for digital backend implementation. Foundry provides detailed introductions to all metal stacks in their design manuals, and when adopting a new process node, digital backend engineers must carefully review relevant documents, such as design manuals, design guidelines, or implementation guidelines.

The eighth line indicates the range of route layers set during digital backend implementation (P&R), from M2 to M9. So what are the top two metal layers used for? Obviously, they are used to plan our power. Why can’t M1 be used for routing? Do you know the reason?

The ninth line refers to the library used for design implementation with different threshold VT.

The tenth line indicates that the design adopts a channel length of 16 for c16 ulvt (Ultra Low Voltage Threshold). It is important to note that while using ulvt can improve timing and speed, it can lead to significant leakage, so it should be used cautiously. It is also noted that this A75 CORE implementation uses ARM’s POP library. The so-called POP library refers to a set of high-speed libraries customized by ARM to meet the high-performance needs of some customers.

Understanding the Digital Backend Implementation Report of ARM Cortex-A75 CPU

The eleventh and twelfth lines indicate that three different VTs were chosen for design implementation, namely ULVT, LVT, and SVT, to optimize timing and power consumption using library cells of three different channel lengths.

The last line indicates that this design uses a single voltage and separate power supply.

Additionally, based on my years of experience dealing with ARM, the official data provided is not necessarily reliable; all data can only serve as a reference rather than a comparison standard. Here are a few points to note.

  • Design margins

For example, during PT signoff, what are the uncertainty and derate settings for setup and hold? These values will directly affect various indicators of the chip, such as area, performance, and power consumption.

  • Whether the design data has been fixed for hold

Many times when ARM releases data, they often only reach the PR stage and have not performed hold fix. Can such data be used for comparison?

  • Whether the design is routable

If the data released by ARM is based on a design that has many Physical DRCs, especially many shorts, they may provide very nice data, but achieving that in actual physical implementation may be very difficult.

  • Physical conditions of the design implementation

ARM generally chooses the most comfortable and easiest shapes (boundaries) for design evaluation when hardening modules. Especially for modules like CPUs, differences in shape can lead to significant performance differences. However, for ARM, they can only provide a standardized reference with a reasonably regular shape. I think this is reasonable.

For digital backend engineers, every time they implement a module, they need to keep thorough records of various data, organize and summarize them, and save them in table form. I believe that the implementation data reports you create will definitely be more detailed and rigorous than the one presented today, so that your boss can understand the implementation situation at a glance.

Finally, the editor from Wuai IC Community wants to convey a point: in the face of data, digital backend engineers need to maintain a skeptical attitude and continuously practice.

Introduction of the editor’s knowledge community:

Here, the following things have been planned and are being worked on:

  • Writing for ICC/ICC2 lab

  • Backend implementation process based on ARM CPU (already published)

  • Design implementation of high-performance modules using CCD (Concurrent Clock Data) in ICC (already published)

  • Hierarchical flow implementation tutorial based on ARM quad-core CPU (in preparation)

  • Clock tree structure analysis

  • Low power design implementation

  • Regular assignments in the community (the community now supports assignment functions)

Here, you can ask questions about the content of the public account articles or the difficulties encountered in actual projects. The editor will respond within 24 hours (you can also express your views on a certain knowledge point in digital backend design implementation, the difficulties encountered in projects, confusion, or career development planning, etc.).

In short, it is a scaled-down version of a forum that enhances everyone’s interaction. More importantly, WeChat has an entrance for the knowledge community mini-program. The community QR code is as follows; you can scan or long-press to identify the QR code to enter. Currently, there are 54 members in the community, thanks to these 54 friends for their support! All fans are welcome to join! The ultimate goal is to achieve the grand goal of a million annual salary for all members of this knowledge community.(The threshold for the community will become higher and higher, so friends in need should get on board early

Understanding the Digital Backend Implementation Report of ARM Cortex-A75 CPU

Recommended Related Articles

Things to note during the route phase of digital backend design implementation

Summary of the connection methods for secondary power pins in low power design implementation

Digital backend interview Q&A No.22-24 (three questions daily)

Get free digital IC backend implementation training tutorials, plus a box of peaches!

These pitfalls encountered in the later stages of the project can be handled so simply! (Digital backend implementation emergency rescue)

If you want to master various placement techniques thoroughly, this will definitely meet your wishes!

IC media interview: Wuai IC Community

Teach you to easily understand the antenna effect (Process Antenna Effect)

Deeply reveal the principle of asynchronous reset and synchronous release

Digital backend interview Q&A No.19-21 (three questions daily)

Do you really understand these low power design implementation experiences?

Usage of Lockup latch, this is enough!

In-depth analysis of the differences between Create_clock and Create_generated_clock

Overview of various files used in digital backend design implementation

Does clock jitter affect hold time? (There are benefits at the end of the article)

Why use clock inverters on clock trees (min pulse width check)

LVS is that simple! (Digital backend physical verification)

Revealing why net delay is negative (timing in digital backend implementation)

PBA (Path Base Analysis) wants to say love to you is not easy (static timing analysis basics)

Comprehensive overview of clock tree synthesis Clock Skew

Digital backend design implementation of clock tree synthesis practical article

【Amazing!】 You are still using the flatten method for timing signoff

Digital backend interview Q&A No.16-18

A reasonable clock structure can accelerate timing convergence (clock tree synthesis intermediate article)

Digital backend interview Q&A No.13-15 (three questions daily)

【Confidential】 There is no longer a difficult floorplan (digital backend design implementation floorplan article)

Digital backend interview Q&A No.10-12 (three questions daily)

Digital backend interview questions No.7-9 (three questions daily)

I heard that Latch can efficiently fix hold violations (Timing borrowing and its applications)

15 days from zero to master python – the most comprehensive video tutorial

Digital backend interview Q&A No.4-6 (three questions daily)

IR Drop analysis of Redhawk analysis process

Can CRPR compensate for crosstalk?

It turns out that the highest operating frequency of the circuit is calculated this way (STA basics)

Digital backend interview Q&A No.1-3 (three questions daily)

Instantly solve the setup violation problem of clock gating in digital backend implementation

Teach you to easily adjust the timing and congestion consistency between DCT and ICC

Methods to fix setup violations in digital chip design implementation

Things about ECO in digital IC design, it’s really not a thing!

Do you know how to use scan chain reordering?

How to evaluate the quality of floorplan in digital backend design?

Is congestion severe during digital backend implementation, can you hold it?

Advanced placement process in digital backend implementation

Before the final netlist release, what work should you do?

Dynamic power optimization implementation plan based on Physical Aware

In-depth and simple explanation of set_multicycle_path, mastering it thoroughly

【Essential for masters】 The most comprehensive digital IC design classic books electronic version download

The gap between you and digital backend experts is here, come and see!

Is congestion severe during digital backend implementation, can you hold it?

Clock tree synthesis (clock tree synthesis) basic article

【Benefits】 Download various Userguides for digital IC backend

Alright, that’s all for today’s writing. Original content is not easy; if you like it, you can help share and appreciate. Your shares and appreciations are my motivation to keep updating articles. The editor thanks you in advance! Meanwhile, Wuai IC Community (52-ic.com) has officially launched. Wuai IC Community (52-ic.com) is a professional community for exchanging and sharing digital IC design and implementation technology and experience. If you encounter technical problems in learning and work, feel free to leave a message on the public account or add the following contact methods for questions and communication.

Understanding the Digital Backend Implementation Report of ARM Cortex-A75 CPU

Leave a Comment

Your email address will not be published. Required fields are marked *