A Summary of Personal Work Experience from an FPGA Engineer

★

This article is selected from the EETOP forum and is an older post. Many of the development tools and platforms mentioned are now outdated. However, the fundamental ideas of the article remain relevant and are suitable for beginners.

★

I have wanted to write about my work experience over the past few months for a long time, both as a personal summary and to share my experiences with others, hoping to help beginners avoid some pitfalls. However, due to the many tasks at the company and frequent overtime, I have delayed this until now. Joining this company feels like fate—fate that began with NIOS. Back in March, when Altera came to our school to establish an SOPC laboratory, I had no idea what NIOS was. I just wanted to ask the Altera FAE a few questions about timing constraints after his presentation on NIOS and take a copy of the PPT home. Little did I know that because of that NIOS training material, I met Cawan from the forum, who taught me a lot about NIOS. Later, Ding Ge posted a notice about the NIOS competition in the SOC section, and my teammates and I signed up and attended the NIOS training at Sichuan University, where I met Junlong’s FAE—who is now my boss. I want to thank Cawan, Ding Ge, my teammate Liu Ke who participated in the NIOS competition with me, and my boss for giving me this experience. In the few months at the company, I haven’t worked on many projects, but I have gained some insights. I believe the most significant gain is the change in design philosophy, which is what I want to summarize in this article. Timing is designed, not simulated. My boss has a background working at Huawei and Junlong, so he naturally shared some insights about logic design from Huawei and Altera, and our project specifications are largely based on Huawei’s standards. During these months of work, what struck me the most was Huawei’s saying: “Timing is designed, not simulated, and certainly not improvised.” In our company, every project undergoes strict reviews; only after passing the review can we proceed to the next steps. For logic design, we do not start coding immediately; instead, we first write an overall design plan and a detailed logic design plan. We must wait for these plans to pass review and be deemed feasible before we can begin coding. Generally, this part of the work takes much longer than the coding itself. The overall plan mainly involves module division, the interface signals and timing of primary and secondary modules (we require a description of the timing waveforms of the interface signals), and how to test the design in the future. In this first-level plan, we must ensure that the timing converges to the primary module in future designs (ultimately in the secondary module). What does this mean? When we do detailed design, we will certainly make some adjustments to the timing of certain signals, but these timing adjustments can only affect the current primary module and should not impact the entire design. I remember when I was in school, due to a lack of understanding of design timing, I often had to adjust the timing of other module signals because one signal’s timing did not meet the requirements, which was quite frustrating. At the detailed logic design level, we have already designed the interface timing for each module, and the internal implementation of each module is basically determined. By achieving this, coding becomes much faster, and most importantly, this approach keeps the design in a controllable state, preventing the entire design from needing to start over due to an error in one part. The difficulty of logic design lies in system architecture design and simulation verification.

When I first joined the company, my boss told me that the difficulty of logic design does not lie in the design of RTL-level code, but in system architecture design and simulation verification. Currently, there is a lot of emphasis on synthesizable design in China, while there seems to be little information on system architecture design and simulation verification, which may reflect the relatively low level of design in the country. Back in school, I always thought that as long as the RTL-level code was done well, simulation verification was just a formality, so I looked down on the syntax for HDL behavioral descriptions and was reluctant to learn about testbenches—because I thought drawing waveform diagrams was easier; I had no understanding of system architecture design. After joining the company and encountering some real-world scenarios, I realized it was completely different. In fact, abroad, the time and manpower spent on simulation verification is about twice that spent on RTL-level code; simulation verification is now the critical path for million-gate chip designs. The difficulty of simulation verification mainly lies in how to model accurately and completely to verify the correctness of the design (mainly to improve code coverage), and during this process, verification speed is also crucial. In simple terms, verification is about how to generate sufficient coverage stimulus sources and how to detect errors. I personally believe that the most fundamental aspect of simulation verification is to achieve automation. This is why we need to write testbenches. In one of my current designs, each simulation run takes about an hour (which is considered a small design). Since drawing waveform diagrams cannot achieve verification automation, if we rely on waveform diagrams for simulation, we face several issues: first, drawing waveforms can be tedious (especially for complex algorithms or designs with statistically distributed inputs); second, analyzing waveform diagrams can be overwhelming; and third, the error detection rate is almost zero. So how do we achieve automation? My personal skills are still limited, so I can only briefly discuss BFM (Bus Function Model). For example, when creating a MAC core (with a PCI bus as the backplane), we need a MAC_BFM, PCI_BFM, and PCI_BM (PCI Behavior Model). The main function of MAC_BFM is to generate Ethernet frames (stimulus sources) of random lengths and headers, with random content, while simultaneously copying them to PCI_BM; the function of PCI_BFM is to simulate the behavior of the PCI bus, for instance, when the DUT receives a correct frame, it sends a request to the PCI bus, and PCI_BFM responds and collects the data; the main function of PCI_BM is to compare what MAC_BFM sends with what PCI_BFM receives. Since it has both the sending information from MAC_BFM and the receiving information from PCI_BFM, as long as the design is reasonable, it can automatically and completely test whether the DUT is functioning correctly, thus achieving automatic detection. Huawei is estimated to be one of the best in simulation verification in China; they have established a good verification platform, and most BFMs related to communication are well-developed. A friend told me that now they only need to place the DUT on the test platform and configure the parameters to automatically verify the correctness of the DUT’s functionality. After completing functional simulation, since we are working on FPGA design, we have already ensured that the RTL-level code is consistent with the functional simulation results. As long as the static timing report after synthesis and layout does not have warnings about violating timing constraints, we can proceed to debug on the board. In fact, at Huawei and ZTE, they do not perform timing simulations for FPGA designs because timing simulations are time-consuming and may not yield better results than reviewing static timing analysis reports. Of course, for ASIC designs, the workload for simulation verification is larger, especially when dealing with multi-clock domain designs, where post-simulation is generally still performed. However, before post-simulation, they typically use formal verification tools and static timing analysis reports to check for any violations of design requirements, which significantly reduces the workload for post-simulation. Regarding HDL languages, there is much debate in China about whether VHDL or Verilog is better. Personally, I believe this is not very meaningful; most large companies use Verilog for RTL-level coding, so I recommend that everyone try to learn Verilog. In terms of simulation, since VHDL is weaker than Verilog in behavioral modeling, there are very few simulation models written in VHDL. However, this does not mean that Verilog is superior; in fact, Verilog also has limitations in complex behavioral modeling, such as its lack of support for arrays. In some complex algorithm designs, higher-level languages are needed to abstractly describe behavioral models. Abroad, simulation modeling is often done using SystemC and E languages, and using Verilog is considered quite outdated. It seems that Huawei’s verification platform is written in SystemC. In terms of system architecture design, since my designs are not large enough, I cannot claim to have much experience. I just feel that some knowledge of computer system architecture is essential. The primary basis for division is functionality, followed by the selection of appropriate bus structures, storage structures, and processor architectures. Through system architecture division, we aim to make the functional modules clear and easy to implement. I plan to share my insights on this part after some time, so I won’t mislead everyone for now. Standards are very important. Friends who have worked in companies know that there is a strong emphasis on standards, especially for large designs (whether software or hardware). It is almost impossible to implement a design without following standards. Logic design is no exception: if you do not follow the standards, you may find errors during debugging a month later, and when you look back at your code, you will likely have forgotten many signal functionalities, let alone debugging; if someone leaves halfway through a project, the successor will probably have to start the design from scratch; if new functionalities need to be added to the original version, it may also require starting over, making it difficult to achieve design reusability. In logic design, I believe the following standards are particularly important: 1. Design must be documented. The design ideas and detailed implementations must be written into documentation, which must pass strict reviews before proceeding to the next steps. This may seem time-consuming at first, but from the overall project perspective, it is definitely more time-efficient than jumping straight into coding, and this approach keeps the project in a controllable and feasible state. 2. Code standards.

a. Designs must be parameterized. For example, if the initial design clock period is 30ns and the reset period is 5 clock cycles, we can write:

parameter CLK_PERIOD = 30; parameter RST_MUL_TIME = 5; parameter RST_TIME = RST_MUL_TIME * CLK_PERIOD; … rst_n = 1’b0; # RST_TIME rst_n = 1’b1; … # CLK_PERIOD/2 clk <= ~clk; If in another design the clock is 40ns and the reset period remains unchanged, we only need to reinstantiate CLK_PERIOD, making the code more reusable.

b. Signal naming must be standardized. 1) Signal names should be all lowercase, and parameters should be uppercase. 2) For active-low signals, use the _n suffix, such as rst_n. 3) Port signals should be arranged uniformly, with one signal per line, preferably arranged according to input/output and the relationship of which module the signal comes from, making it much easier to find errors during later simulation verification. For example:module a(//input clk, rst_n,//global signal wren, rden, avalon_din,//related to avalon bus di,//related to serial port input //output data_ready, avalon_dout,//related to avalon bus …);4) A module should ideally use only one clock; here, a module refers to either a module or an entity. In multi-clock domain designs, it is best to have a dedicated module for clock domain isolation. This allows the synthesizer to produce better results. 5) Logic should be done at the lower-level modules, while higher-level modules should primarily instantiate; the top-level module should only instantiate and should not contain any glue logic, even if it is just inverting a signal. The reasoning is the same. 6) In FPGA designs, pure combinational logic should not be used to create latches; latches with D flip-flops are allowed, such as in configuration registers. 7) Generally, signals entering the FPGA must be synchronized to improve system operating frequency (at the board level). All module outputs should be registered to enhance operating frequency, which is also very beneficial for achieving timing convergence. 8) Unless it is a low-power design, do not use gated clocks, as this increases design instability. In places where gated clocks are necessary, the gated signal should be clocked on the falling edge before being output and ANDed with the clock. clk_gate_en—————–|D Q|——————|
gate_clk_out||———| ）—————o|》|||/clk|——–|—9) Do not use signals from counters divided down as clocks for other modules; instead, use a clock enable method. Otherwise, this approach of having clocks everywhere is detrimental to design reliability and greatly increases the complexity of static timing analysis. For example, if the input clock for the FPGA is 25M, and we need to communicate with a PC via RS232 at the rs232_1xclk rate, do not do this:

always (posedge rs232_1xclk or negedge rst_n)

begin …

end Instead, do this:

always (posedge clk_25m or negedge rst_n)

begin …

else if (rs232_1xclk == 1’b1) …

end How to Improve Circuit Operating Frequency As designers, we certainly hope that the operating frequency of our designed circuits (here, unless otherwise specified, the operating frequency refers to the frequency within the FPGA) is as high as possible. We often hear about using resources to exchange for speed, and pipelining can indeed improve operating frequency; this is a very important method. Today, I want to further analyze how to improve circuit operating frequency. First, let’s analyze what affects the operating frequency of a circuit. The operating frequency of our circuit mainly depends on the signal propagation delay between registers and clock skew. In the FPGA, if the clock travels long distances, clock skew is minimal and can generally be ignored. For simplicity, we will only consider the factors of signal propagation delay. Signal propagation delay includes the switching delay of registers, routing delay, and delay through combinational logic (this classification may not be entirely accurate, but it should suffice for problem analysis). To improve circuit operating frequency, we need to address these three delays to make them as small as possible.
First, let’s look at switching delay, which is determined by the physical characteristics of the devices and cannot be changed. Therefore, we can only improve the routing method and reduce combinational logic to enhance operating frequency. 1. Reduce delay by changing the routing method. Taking Altera devices as an example, we can see many blocks in the timing closure floorplan in Quartus. We can categorize these blocks by rows and columns; each block represents one LAB, and each LAB contains 8 or 10 LEs. The routing delay relationships are as follows: within the same LAB (fastest) < same column or row < different rows and different columns.
By adding appropriate constraints to the synthesizer (not being greedy, generally adding about 5% margin is appropriate; for example, if the circuit operates at 100MHz, set the constraint to about 105MHz; being greedy can lead to worse results and significantly increase synthesis time), we can try to keep related logic close together during routing to reduce routing delay. (Note: Achieving constraints is not solely about improving layout and routing to enhance operating frequency; there are other improvement measures as well.) 2. Reduce delay by minimizing combinational logic. As mentioned above, we can improve operating frequency by adding constraints, but we must not rely solely on the hope of improving operating frequency through constraints at the beginning of the design. We need to avoid large combinational logic through reasonable design to enhance circuit operating frequency, which will also improve the portability of our design, allowing it to be used when ported to another chip of the same speed level.
We know that most FPGAs are based on 4-input LUTs. If the conditions for a corresponding output exceed four inputs, multiple LUTs must be cascaded to complete the logic, introducing an additional level of combinational logic delay. To reduce combinational logic, we need to minimize the number of input conditions, allowing for fewer cascaded LUTs and thus reducing the delay caused by combinational logic. The pipelining we often hear about is a method to improve operating frequency by cutting large combinational logic (inserting one or more D flip-flops to reduce the combinational logic between registers). For example, a 32-bit counter with a long carry chain will inevitably lower the operating frequency. We can split it into 4-bit and 8-bit counters, triggering the 8-bit counter every time the 4-bit counter reaches 15, thus achieving counter segmentation and improving operating frequency. In state machines, large counters should generally be moved outside the state machine because counters often exceed four inputs. If they are used as conditions for state transitions, they will inevitably increase LUT cascading and thus increase combinational logic. For example, in a 6-input counter, we originally wanted the state to transition when the counter reached 111100. Now we move the counter outside the state machine, and when the counter reaches 111011, it generates an enable signal to trigger the state transition, thus reducing combinational logic.
What has been discussed above is how to segment combinational logic through pipelining, but in some cases, it is challenging to segment combinational logic. In such cases, what should we do? State machines are one such example; we cannot add pipelining to the state decoding combinational logic. If our design includes a state machine with dozens of states, its state decoding logic will be enormous, which is likely to be the critical path in the design. So what should we do? The old approach still applies: reduce combinational logic. We can analyze the outputs of the states, reclassify them, and redefine them into groups of smaller state machines. By selecting inputs (using case statements) and triggering the corresponding smaller state machines, we can achieve the segmentation of large state machines. In the ATA6 specification (hard disk standard), there are about 20 commands, each corresponding to many states. It is unimaginable to use a large state machine (state within a state) for this; we can use case statements to decode the commands and trigger the corresponding state machines, thus allowing this module to run at a higher frequency.
In summary, the essence of improving operating frequency is to reduce the delay from register to register. The most effective method is to avoid large combinational logic, which means satisfying the conditions for four inputs as much as possible and reducing the number of LUT cascades. We can improve operating frequency through constraints, pipelining, and state segmentation methods.

Follow eetop-1, reply with the following red keywords to read recommended articles:

fpga01 – Reading and commentary on digital front-end and FPGA design-related literature

Zynq on-chip XADC application notes

Introduction to the clock subsystem of Zynq devices

Understanding setup and hold times, setup and hold margins

C programming experience and skills for soft core MicroBlaze

Application of Tcl in Vivado

Introduction to using Vivado: using IP cores

Talking about setup time and hold time

Detailed introduction to using Vivado 1: creating projects, writing code, behavioral simulation, Testbench

fpga02 – Design ideas for clock switching circuits to prevent glitches

Brief description of function generator implementation methods

Detailed introduction to using Vivado 2: synthesis, implementation, pin assignment, clock settings, and programming

Design of multi-port read/write memory management based on FPGA DDR3

Implementing the Oberon system on a low-cost FPGA development board

Fully programmable abstraction: you control your programming

Tips for using XILINX FPGA FIFO

How to handle multiple image sensors in intelligent vision systems?

Design of FPGA clock and reset circuits

Design of synchronizers

fpga03 – Skill tree for digital IC engineers

FPGA design, treating timing as everything

In FPGA design, timing is everything

Using FPGA to characterize large-scale MIMO channels

How to port PetaLinux to Xilinx FPGA

Discussion on the inconsistency between FPGA design simulation and hardware testing

Where is FPGA suitable? What are the uses of OpenCL, C, and C++ languages for FPGA and full SoC?

fpga04 – Efficient parallel real-time upsampling using Xilinx FPGA

Making FPGA vision functions accessible to the public

Developing innovative programmable networks using SDNet

Accelerating AES encryption using SDSoC

Introduction to FPGA power supply

Introduction to new features of SDSoC – Trace

Understanding and utilizing Timing Report

Implementing division operations in FPGA

FPGA Introduction 01 –

FPGA Introduction 02 –

For business inquiries, please add my personal WeChat: jack_eetop or QQ: 228265511

Striving to create the number one brand for electronic engineers on WeChat in China!

Click the bottom left corner to read the original text

Related posts

Leave a Comment Cancel reply