Introduction to the Basic Simulation Environment of SoC
I have written one on the forum. http://bbs.eetop.cn/thread-442797-1-1.html “How to Build the Basic Testbench for SoC Projects (My Process)”. Here, I will mention the important and changed parts.
Assuming this SoC has a CPU system, memory controller, bus topology, PAD, Clock reset, and some logic functional modules.
1. The simulation environment includes embedded software (firmware)
This includes two parts: one is the initialized bootloader (usually stored in ROM or external flash), and the other is the application-level program stored in external volatile storage media after booting.
2. Use Instruction Set Simulator (ISS) to replace CPU-IP.
There are both open-source and free ARM processor IP ISS. Considering that ISS itself is not real, if it is not for verifying bootrom code, I personally suggest finding an open-source one. ISS can be compiled into a .so library file, so during simulation, you don’t have to compile the entire ISS C code (you need to set LD_LIBRARY_PATH to tell the simulator where to look during simulation; during the compilation process, you need to specify the library path and library name).
ISS requires a configuration file to inform it of the CPU’s address access space, such as the program area and stack area managed by ISS (let’s call it memory space), while the DUT’s memory address and DUT’s register space are managed by DUT (let’s call it IO space). ISS needs to see all address spaces and determine whether it is memory space or IO space based on the address to perform different operations.
3. Shared space. When the CPU (ISS) interacts with the Testbench, a designated address space can be specified (e.g., 0x3000_0000 ~ 0x3100_0000), which is an array in the testbench. For example, to achieve random configuration of registers, since the ISS C program is inconvenient for constrained random generation, you can write the constrained random generated values into this shared space in the Testbench component, and then let the ISS C read the values from the shared space and configure them into the registers.
4. Maintenance of define files.
In SoC projects, a define in the Testbench may need to be used simultaneously in the assembly program, embedded C program, C model’s C program, and Testbench’s SV code. Maintaining multiple similar define files is certainly prone to errors, so only one should be maintained, and the others should be automatically generated by scripts.
5. Replacement of the memory controller module (DDRC).
DDRC is the main BUS-Slave in the system, usually requiring an initialization process (some DDRCs do not require it), and typically needing a DDR-PHY model, which can significantly affect simulation speed. Moreover, when verifying functional modules, using DDRC can make it difficult to simulate scenarios with varying bandwidths. Therefore, consider using a BUS-Slave BFM to replace DDRC during verification.
6. Implementation of printing. In the embedded C language programs of SoC environment test cases, there is no standard stdio, so printf needs to be implemented. printf is a “function with a variable number of parameters”. It utilizes the fact that “parameters are pushed onto the stack from right to left, with the first parameter closest to the stack top, and the last character of the string being \0” to implement printf. On the C side, it is sufficient to pass the address of the first parameter (which is definitely not 0) to a specified position in the shared space, and on the SVTB side, after obtaining the address according to % and the parameter address, the implementation can be done (the SV side implementation is quite similar to the implementation of printf in C). It should be noted that 1) the C program must ensure that the parameter address is written to the shared space in a timely manner, and not stuck in the cache or taking too long to transfer; 2) the information to be printed using ISS must exist in memory space so that the SV side can see the memory space. When using multi-core CPU-RTL, pay attention to the printing control of different cores (you can allocate a shared space for each core and execute different operations based on the core-id in the print function).
Implementations of malloc and other functions can also utilize shared space.
7. Collaborative verification of multiple modules. At the system level, it is often necessary to run system-level cases to simulate the scenario where the entire system works together (this case typically also applies to power analysis). This case may require considerable effort in constructing test cases. If there is a hardware simulation accelerator, it is better; if it can only be done in a purely simulation environment, try to simplify the process.
8. Single case process: compile DUT and testbench à compile firmware à start simulation (load firmware and generate random control data to write into shared space at the appropriate time. If the firmware needs to be loaded into the DDR model, and the DDRC initialization process performs data-training operations that write data, ensure that the initialized data is not overwritten).
Module-level verification has advantages over system (subsystem) level simulation environments:
1) Fast simulation speed
2) Good controllability of randomness
3) Easier to perform Error-Injection
4) Easier to switch modes and toggles
Simulation tools (versions after 2010) support merging coverage at both module and system levels, which can accelerate convergence.
It should be noted that although the module-level environment can theoretically cover all situations in the system-level environment, it may not achieve 100% coverage under limited manpower and time resources. For example: the changes in the data enable signal for external synchronization in video are so numerous that random generation may not cover situations that may occur in the actual system. In summary: the key is that the module-level environment may not cover situations that may occur in reality.
Basic Structure of Module-Level Environment
In the module-level basic environment, in addition to having verification components (monitor driver scoreboard, etc.), there is also a bus connecting Master and Slave, and a CPU-Model controlling DUT. Note: The DUT referred to in the module-level environment is a complete module, without considering the internal submodules of a module.
Assuming the DUT is a bus master device that initiates memory access operations. The CPU-Model is responsible for configuring registers and some memory accesses. To simplify the module-level environment, the CPU-Model directly accesses the Bus-Slave’s space instead of going through the bus.
Reuse of Module-Level Environment Testcases at System Level
In situations with limited personnel and time resources, ensure that the module-level code can be reused at the system level. Generally, reusing Testbench components (Driver Monitor Model Scoreboard, etc.) is relatively simple, but reusing test cases in the SoC system level (C programs on the CPU) is more complicated. The construction of test cases is determined by the implementation method of the CPU-Model in the TB architecture. Below are the two main solutions I use, with test cases in C.
Solution 1:
Implement using ISS. This is quite consistent with the basic simulation environment of SoC.
Solution 2:
Implement using DPI. The CPU-Model is an SV model that implements tasks for register access and memory address space access, with the complexity being the simulation of interrupts. You can use sv‘s fine-grain-process(process::self()) to implement the main-task and irq-task, simulating the behavior of “the CPU suspends the main program after entering an interrupt, and returns to the main program after completing the interrupt” (main_task.suspend(); ….; main_task.resume();). Use DPI to pass the underlying hardware interface’s SV-Task to the C program.
In the DPI implementation solution, pay attention to the underlying hardware interface code, as well as the code in C that directly accesses it (for example, operations on pointers pointing to DUT memory), especially in the reference code provided by IP-Vendors, which likely contains similar code. For example, the protocol stack for USB Ethernet’s software-hardware interaction is maintained in memory, and the software code generally maintains a data structure, then the pointer points to the address of the data structure for operations. In the DPI environment, such code cannot be used directly (pointer operations are not allowed, and must be replaced with underlying hardware code implementation). Thus, the firmware code for the module may be ready-made, and the simulation should strive for reuse. A simple solution is to directly abandon the module-level environment, validate in the system-level environment, or use ISS. In the ISS environment, firmware code may also need modifications; for instance, the aforementioned data structure address must be allocated to the DUT memory address (such as performing an assignment operation on a large struct, first malloc a space, and then fill that space. As long as malloc is allocated to the DUT memory address, it is fine).
Other Solutions:
Testcases can be implemented directly using SV without C; or replace the CPU-Model with a real CPU-IP, including boot-rom and boot-ram.
Architecture Evaluation
I personally recommend hardware accelerators. It is best to ensure the frequency ratio (the bus access frequency of the memory controller, the frequency of components in the bus topology, and the working frequency of the memory controller and memory). I feel that hardware accelerators like Palladium and Veloce are more suitable.
If simulation methods are used for evaluation, pay attention to:
1) The authenticity of patterns. Ready-made module RTL can reflect real memory access behavior and latency tolerance, but the construction environment is relatively complex, simulation speed is slow, and usually requires an initialization process to work. I personally suggest constructing BFM to simulate memory access behavior: BFM can consume configuration files to simulate more realistic scenarios.
2) Automated integration of basic infrastructure. In slightly complex SoC architectures, the integration of bus topology can become very complicated (uniform components are not convenient to use emacs’ automatic integration function, and matching the width of some special signals manually is prone to errors). Years ago, there were automated tools to achieve this integration. However, commercial tools, while providing many functions, may not directly meet the needs of individual projects, so it is advisable to consider developing automated integration tools. Currently, in the low-power flow, there is no need to write much additional code in the RTL of architecture integration, simplifying the difficulty of automatic integration.
The architecture evaluation environment also needs performance monitors for memory controllers and bus architecture modules to track throughput, memory controller efficiency, and latency of functional module memory access behavior, etc., to see whether the architecture evaluation environment is consistent with the expected memory data volume, and under this premise, to see how efficient the memory controller is, and which module in the system may exhibit unreasonable high latency (indicating that the module may need to increase FIFO depth during design). By adjusting the priority strategies of the memory controller and bus topology modules, clock frequencies, or increasing or decreasing the number of bus topology components and Master/Slave ports, different simulations can be conducted.
Use of VIP
Generally, we use VIP to verify complex standard interface protocol data communication modules.
The advantages of VIP are that it does not require special development of BFM, has complete TB components, and some documentation includes relatively complete test plans and examples, with completeness in randomness and error injection. The downside is that the code is not visible, making it easy to get stuck when problems arise; VIP as a tool may have OS and simulator compatibility issues; and insufficient local support personnel from EDA vendors, among other issues. Usually, the speed of VIP is relatively slow.
If capabilities and resources permit, BFM can be developed in-house. A relatively quick solution is to reuse reliable RTL IP. For example, if we want to verify USB host, we can look for a USB device or uotg RTL IP as BFM. The advantage of this method is that BFM code quality is relatively high, debug visibility is good, and it can be better migrated to hardware accelerators in the future (some can connect directly without PHY, just digital interfaces). The downside is that the amount of development code is also considerable, and some PHY models may pose issues (for example, some unidirectional data transmission, where the Master side is parallel data converted to LVDS, and the Slave side is LVDS converted to parallel data; this PHY model may need additional development), and RTL usually requires an initialization process, with randomness and error injection not being very user-friendly.
I believe that if the DUT is internally developed, it is more suitable to use commercial VIP due to the potential quality issues of the code. If the DUT is purchased and has already undergone silicon verification, I think a self-developed BFM is sufficient.
Coverage-Driven and Assertion-Based
I personally believe that code coverage is the most important and must be counted and checked carefully.
Functional coverage (including assertion coverage) should reflect the Testplan. I think it only provides statistical data on the coverage of the Testplan, which may be incomplete, so achieving 100% functional coverage does not necessarily mean sufficient verification.
Assertions are quite suitable for verifying the timing of some not very complex protocols, but it seems that in recent years, EDA companies have not invested much in assertions. I believe that for a large module’s internal submodules, assertions are very suitable, as they can specifically describe and verify assertions for the submodule’s interface.
Acceleration at the Simulation Level
The fastest acceleration technology is definitely hardware simulation accelerators.
In the “Introduction to the Basic Simulation Environment of SoC”, the replacement of CPU and DDRC both have acceleration functions. In addition, there are some acceleration methods I am using:
1) Dummy unnecessary modules in the system-level simulation environment. —– This requires two scripts, one to automatically generate dummy files and another to automatically replace existing RTL with dummy files.
2) Shorten the simulation time for SoC power-up. —- Usually achieved by a state machine based on several counters. The method is to change the initial values of the counters or the values needed for transitions, which is easier to implement at the RTL level; however, at the gate level, it is more troublesome to find signals, so it is best to communicate with colleagues in the flow in advance to maintain signal names.
3) Some IPs in the DUT that consume simulation resources heavily can be replaced with simplified models, such as PLL, DCM, and certain PHY models.
4) Some RTL code writing methods may consume simulation resources heavily. For example, initializing a large two-dimensional array with initial values at every rising edge of the clock when reset is active. It is best to rewrite such code using ifdef-else-endif to turn it into an Initial-block.
5) Reduce overly frequent interactions with DPI.
6) Use of separate-compile and partition-compile techniques should be approached with caution, as they may have negative effects.
Gate-Level Simulation
In functional simulation, there are generally several types of gate-level simulations:
1) Post-synthesis netlist simulation
2) Post-DFT netlist simulation
3) Post-PR netlist simulation with SDF back-annotation
4) FPGA post-synthesis netlist simulation
5) Gtech netlist simulation
Post-synthesis and post-DFT netlists are relatively similar; usually, running post-DFT is sufficient. Except for the SDF back-annotation required for post-PR netlists, others are running 0-delay gate-level simulations. Generally, a few additional processes are done:
1) Add clk2q delay to timing logic (dff, etc.) in std-lib cell library files.
2) Ensure that the output of SRAM models does not output X when not in operation.
3) In gate-level netlists, if there are dffs without reset terminals, they should generally be identified for $deposit processing (the method can be to ask colleagues in the flow for a version of sdf corresponding to the netlist, and then parse the sdf file to obtain the corresponding instances with dff).
4) For gate-level simulations with SDF back-annotation, if checking timing, be careful to remove the timing checks for logic that crosses clock domains, such as 2dff.
FPGA post-synthesis netlist simulations are usually not necessary, but if there are no issues with FPGA timing reports, FPGA functional simulations, CDC, and Lint-check, and there are suspicions of FPGA synthesis issues, it is worth conducting. I have discovered several issues during FPGA post-synthesis netlist simulations: 1) Xilinx-V7 defaults to implementing complex case statements in RTL using blockram, resulting in timing discrepancies; 2) FPGA directly implements some operations in RTL using internal DSP, resulting in functional synthesis errors. The signal names in FPGA netlists are often too chaotic, making it difficult for the Testbench to compile if it pulls some internal signals for observation or forcing.
Gtech netlist simulations are rarely needed, but if running Gtech netlists on FPGA to replace RTL, ensure that the behavior description of FPGA version Gtech cells is consistent with ASIC, otherwise, it is easy to encounter “pits”.
Post-PR gate-level simulations focus on producing accurate IR-Drop and Power data, as well as ensuring the completeness and correctness of timing constraints, so it is recommended to only run typical cases.
Automation of Simulation Verification
Here are some automation technologies outside of automatic checking mechanisms for Testbench (many are executed automatically using crontab).
1) Automatically checkout a copy of the code daily for mini-regression, and automatically send the results via email to the project team.
2) Automatically update to a new tag for mini-regression every few hours once a new tag is detected, and automatically send an email.
3) Automatically list all checked-in codes daily and send an email.
4) Automatically summarize unresolved issues daily and send emails (bug-zilla, issue-tracker, Jira, etc. bug-tracking systems may have this feature built-in, while others may need to be implemented manually).
5) Automatic code backup (some codes may still be in development, so they do not want to check into the codebase; if MIS does not provide automatic backup for such codes, a backup program may be needed).
6) In addition to automatic environment comparison, there should also be a program to parse compilation and simulation logs called during regression.
If a certain function lacks an automatic checking mechanism, efforts should be made to reduce the manual comparison workload. For example, if a C-code for an image algorithm module is missing, but the RTL is golden, running actual graphics patterns one by one can be time-consuming; you can package the images generated during regression into a webpage and view them using a browser.
Scripting languages are a crucial part of simulation work. We generally use shell, perl, tcl, python, etc. It is recommended to force yourself to use the language you are learning to solve problems encountered during the learning process.
Verification Project Management
http://bbs.eetop.cn/thread-581216-1-1.html “Introduction to the Work of Verification Project Leader for Multimedia SoC Projects (Discussion)” I have detailed this in the thread; the work of Verification Engineer is basically a subset of the Verification Project Leader’s work.
The thread is written according to project preparation before the start à project initiation but without providing the first version of integration code à version 0.5 à version 0.75 à version 0.9 à version 1.0 à time of TO and after TO project development time. Here, I will list a few points not mentioned in the thread.
The leader should grasp the verification process. Before verifying on FPGA, in addition to simulation and checking FPGA timing, static verification checks such as CDC and Lint should also be performed.
The leader should timely train new colleagues in the project team on basic skills in the work environment to avoid colleagues in the group experiencing reduced efficiency due to work environment issues (for example, different tools have different requirements for machines; some require large cache, while others require large memory; it may be necessary to consider cache, memory, cpu-core-num, cpu-frequency, etc. If MIS has not divided LSF computing resources according to machine performance, using the pre-configured allocation strategy of LSF may submit tasks to unsuitable machines).
Be careful to avoid verification engineers overly relying on module-level environments. For example, if an issue is reported on FPGA that cannot be reproduced at the module level, it is assumed that there is no functional problem.
The leader must promptly summarize all bugs omitted during simulation; during the development phase, these are usually bugs exposed by module-level environments running simulations and those discovered on FPGA. Some bugs may have been omitted due to the FPGA being in a state before simulation verification, and bugs that are indeed missed by verification engineers should be given special attention.
Generally, power data comes from gate-level simulation VCD or Saif files. As designs become larger, VCD files may become excessively large; in this case, it is important to communicate with colleagues running power analysis to see if there are suitable methods to resolve this. Some hardware accelerators can internally analyze which periods have significant transitions and can use this information to dump VCD. One point to note: when there are too many signals to dump, the size of the dumped VCD and the size of compressed format files like fsdb are roughly the same (a large amount of storage is used to build the signal name table). In this case, directly dumping VCD is sufficient and can avoid introducing PLI for dumping waveforms and subsequent format conversion work.
Theoretically, verification is never complete; there are things to pursue and things to let go. The leader cannot take on everything; some tasks must be delegated in the case of limited resources.
Timely Summary, Recording, and Analysis
Projects with a high degree of reuse are prone to experience errors; caution is essential. Strengthen reviews of modified areas.
Source: EETOP