It has been over ten years since I first came into contact with FPGA during my university days. I still remember the excitement when I completed experiments such as a digital stopwatch, buzzer, and password lock on the EDA experimental platform for the first time. At that time, I hadn’t been exposed to HDL hardware description language, and the designs were built using 74 series logic devices in the MAX+plus II schematic environment. Later, during my graduate studies and work, I gradually used Quartus II, Foundation, ISE, Libero, and learned Verilog HDL. During the learning process, I slowly realized the wonders of Verilog; a small piece of code can accomplish complex schematic designs, and its portability and operability are much stronger than schematic design.
Friends who have worked in companies know that there is a strong emphasis on standards, especially for large designs (whether software or hardware). Not following the standards makes it nearly impossible to achieve results. Logic design is the same: if you don’t follow the standards, you may find errors during debugging a month later. Looking back at your own code, you probably will have forgotten many signal functions, let alone debugging them; if a person leaves halfway through a project, the successor may have to start designing from scratch; if new features need to be added based on the original version, it may also have to start over, making design reusability very difficult.
In terms of logic, I believe the following standards are important:
1. Design must be documented. The design ideas and detailed implementations must be written into documents, and only after strict reviews can the next steps proceed. This may seem time-consuming at first, but from the overall project perspective, it is definitely more time-saving than jumping straight into coding, and this approach keeps the project controllable and feasible.
2. Code standards. If the clock in another design is 40ns and the reset period remains unchanged, we only need to reinstantiate CLK_PERIOD, making the code more reusable.
3. Signal naming must be standardized.
-
Signal names should be all lowercase, while parameters should be in uppercase.
-
For signals that are active low, they should end with the _n suffix, such as rst_n.
-
Port signals should be arranged uniformly, with one signal per line, preferably arranged by input/output and the relationship of which module the signal comes from and goes to, making it much easier for later simulation verification and debugging.
-
A module should ideally only use one clock. Here, a module refers to either a module or an entity. In multi-clock domain designs, it is best to have a dedicated module for clock domain isolation. This allows the synthesizer to produce better results.
-
Logic should be done at the lower-level modules, while instantiation should be done at the higher levels; the top-level module should only do instantiation, prohibiting any glue logic, even if it’s just inverting a signal. The reason is the same.
-
In FPGA design, pure combinational logic should not generate latches; latches with D flip-flops are allowed, such as configuration registers.
-
Generally speaking, signals entering the FPGA must be synchronized to improve system operating frequency (board level).
-
All module outputs should be registered to improve operating frequency, which is also very beneficial for achieving timing closure in design.
-
Unless it is a low-power design, avoid using gated clocks, as this can increase design instability. In places where gated clocks are necessary, the gated signal should be clocked on the falling edge before being output and ANDed with the clock.
-
Do not use signals from a counter division as clocks for other modules; instead, use clock enable methods. Otherwise, this clock flying around can be very detrimental to design reliability and greatly increases the complexity of static timing analysis. For example, if the input clock for the FPGA is 25M, and the system needs to communicate with a PC via RS232, data should be sent at the rate of rs232_1xclk.
Timing is Designed
My boss has a background working at Huawei and Junlong, so naturally, he shared some insights about logic design at Huawei and Altera. Our project specifications are basically modeled after Huawei’s standards. During these months of work, the most profound realization I had was Huawei’s saying: timing is designed, not simulated, and certainly not cobbled together. In our company, every project goes through a strict review process, and only after passing the review can we move on to the next steps. For logic design, it’s not about jumping straight into coding; instead, we start by writing an overall design plan and a detailed logical design plan. Only after these plans have been reviewed and deemed feasible can we proceed with coding. Generally, this part of the work takes much longer than the coding itself.
The overall plan mainly involves module division, interface signals, and timing for the first and second-level modules (we require timing waveform descriptions for the interface signals) and how to test the design in the future. In this level of planning, we must ensure that timing converges at the first-level module (ultimately at the second-level module). What does this mean? When we do detailed designs, we will certainly make some timing adjustments for certain signals, but these timing adjustments can only affect the current level module and cannot impact the entire design. I remember when doing designs in school, I often had to adjust the timing of signals in other modules because one signal’s timing did not meet the requirements, which was very frustrating.
At the detailed logical design level, we have already designed the interface timing for each module, and the internal implementations of each module are basically determined. By achieving this, the coding process becomes much faster. Most importantly, this approach keeps the design in a controllable state, avoiding the need to start over due to an error in any one part of the design.
Timing is Designed
For designers, we naturally hope that the working frequency of our designed circuit (here, unless otherwise specified, the working frequency refers to the internal working frequency of the FPGA) is as high as possible. We often hear about trading resources for speed and using pipelining to improve working frequency; this is indeed a very important method. Today, I want to further analyze how to improve the working frequency of circuits.
Let’s first analyze what affects the working frequency of circuits.
The working frequency of a circuit is mainly related to the signal propagation delay between registers and clock skew. If the clock runs long lines within the FPGA, the clock skew is very small and can basically be ignored. For simplicity, we will only consider the factors of signal propagation delay. Signal propagation delay includes the switching delay of registers, routing delay, and the delay through combinational logic (this division may not be very accurate, but it should suffice for problem analysis). To improve the working frequency of the circuit, we need to minimize these three delays. Let’s first look at the switching delay, which is determined by the physical characteristics of the devices and cannot be changed, so we can only improve the routing methods and reduce combinational logic to increase the working frequency.
1. Reduce delay by changing routing methods.
Taking Altera devices as an example, in Quartus, you can see many blocks in the timing closure floorplan. We can divide these blocks into rows and columns, with each block representing one LAB, and each LAB containing 8 or 10 LE. The routing delay relationships are as follows: within the same LAB (fastest), same column or row, and different rows and columns.
By adding appropriate constraints to the synthesizer (not being greedy, generally adding about 5% margin is more suitable; for example, if the circuit works at 100Mhz, then add constraints to reach 105Mhz. Greediness can lead to worse results and greatly increase synthesis time), we can try to route the related logic closer together to reduce routing delays. (Note: the implementation of constraints is not solely through improved layout and routing to increase working frequency; there are other improvement measures.)
2. Reduce delay by minimizing combinational logic.
As mentioned above, we can improve working frequency by adding constraints, but at the beginning of the design, we should never rely on constraints to achieve high working frequency. Instead, we should avoid large combinational logic through reasonable design to enhance the portability of the design, ensuring that it can still be used when ported to another chip of the same speed level.
We know that most FPGAs are based on 4-input LUTs. If the output corresponds to conditions greater than four inputs, multiple LUTs must be cascaded to complete it, introducing a level of combinational logic delay. To reduce combinational logic, we need to minimize the number of input conditions, allowing for fewer cascaded LUTs, thus reducing the delay caused by combinational logic.
Pipelining, as commonly mentioned, is a method to improve working frequency by cutting large combinational logic (inserting one or more D flip-flops to reduce the combinational logic between registers). For example, a 32-bit counter has a long carry chain that will inevitably lower the working frequency. We can split it into 4-bit and 8-bit counts, triggering the 8-bit counter whenever the 4-bit counter counts to 15, thus achieving counter segmentation and improving working frequency.
In state machines, large counters should generally be moved outside the state machine because counters are often larger than 4 inputs. If they are used as conditions for state transitions, they will inevitably increase LUT cascades, thus increasing combinational logic. For example, with a 6-input counter, we originally wanted the state to change when the counter reached 111100. Now, we move the counter outside the state machine, triggering state changes when the counter reaches 111011 by generating an enable signal, thus reducing combinational logic.
The above methods can be used to segment combinational logic through pipelining, but in some cases, it is difficult to segment combinational logic. How should we handle such situations?
State machines are one such example. We cannot add pipelining to the combinational logic of state decoding. If our design has a state machine with dozens of states, its state decoding logic will be enormous, which will undoubtedly be a critical path in the design. So what should we do? The old approach is to reduce combinational logic. We can analyze the outputs of the states, reclassify them, and redefine them into small state machines. By selecting inputs (using case statements), we can trigger the corresponding small state machines, thus achieving the segmentation of large state machines. In the ATA6 specification (hard disk standard), there are about 20 commands for input, each command corresponding to many states. If we were to use a large state machine (state within state), it would be unthinkable. Instead, we can use case statements to decode commands and trigger corresponding state machines, allowing this module to run at a higher frequency.
In summary, the essence of increasing working frequency is to reduce the delay between registers, and the most effective method is to avoid large combinational logic, which means satisfying the four-input condition as much as possible and minimizing the number of LUT cascades. We can improve working frequency through constraints, pipelining, and state segmentation.
The Difficulty of Logic Design Lies in System Structure and Simulation Verification
When I first joined the company, my boss told me that the difficulty of logic design does not lie in the RTL-level code design but in the system structure design and simulation verification. Currently, there is a strong emphasis on synthesizable designs in China, while there seems to be little information on system structure design and simulation verification. This may reflect the relatively low design level in China. When I was in school, I always thought it was enough to get the RTL code right and that simulation verification was just a formality, so I disdained learning the syntax of HDL behavioral descriptions and was reluctant to learn about testbenches—because I thought drawing waveforms was more convenient. I didn’t understand system structure design at all. After joining the company and being exposed to some things, I realized that it is not like that at all.
In fact, abroad, the time and manpower spent on simulation verification is about twice that spent on RTL-level code. Currently, simulation verification is the critical path for million-gate chip designs.
The difficulty of verification mainly lies in how to model to completely and accurately verify the correctness of the design (mainly to improve code coverage). During this process, verification speed is also very important.
In simple terms, verification is about how to generate sufficient coverage stimulus sources and how to detect errors. I personally believe that the most basic principle in simulation verification is to achieve automation of verification. This is also why we need to write testbenches. In one of my current designs, each simulation run takes about an hour (which is actually considered a small design). Since drawing waveforms cannot achieve verification automation, if we rely on drawing waveforms for simulation, we will face several issues: one is that drawing waveforms can be tedious (especially for complex algorithms or designs with statistically distributed inputs), the second is that analyzing waveforms can be overwhelming, and third, the error detection rate is almost zero. So how can we achieve automation? My personal level is still quite limited, and I can only briefly discuss BFM (bus function model).
For example, when creating a MAC core (with a PCI bus as the backbone), we need a MAC_BFM, PCI_BFM, and PCI_BM (PCI behavior model). The main function of MAC_BFM is to generate Ethernet frames (stimulus source) of random length and frame header, with random content, while simultaneously copying one to PCI_BM; the function of PCI_BFM is to simulate the behavior of the PCI bus. For instance, when the device under test receives a correct frame, it will send a request to the PCI bus, and PCI_BFM will respond to it and receive the data; the main function of PCI_BM is to compare the data sent by MAC_BFM with the data received by PCI_BFM. Since it has both the sending information from MAC_BFM and the receiving information from PCI_BFM, as long as the design is reasonable, it can automatically and completely test whether the device under test is functioning correctly, thus achieving automatic detection. Huawei is estimated to be one of the best in simulation verification in China. They have established a good verification platform, with most communication-related BFMs completed. My friends say that now they only need to place the device under test on the testing platform, configure the parameters, and they can automatically check the correctness of the functions being tested.
Once functional simulation is completed, since we are doing FPGA design, we have basically ensured that the RTL-level code is consistent with the synthesis results and functional simulation results. As long as the static timing report after synthesis and layout does not have any timing constraint violations, we can proceed to debugging on the board. In fact, at Huawei and ZTE, they also do not perform timing simulations for FPGA designs, as timing simulations are very time-consuming and the results are not necessarily better than reviewing static timing analysis reports.
Of course, if it is ASIC design, the workload for simulation verification will be larger, especially when it involves multi-clock domain designs; generally, post-simulation is still performed. However, before post-simulation, formal verification tools are typically used to check for any violations of design requirements based on static timing analysis reports. This approach can significantly reduce the workload for post-simulation.
In terms of HDL languages, there has been much debate in China about whether VHDL or Verilog is better. Personally, I believe this is not very meaningful. Most large companies abroad use Verilog for RTL-level coding, so I recommend that everyone learn Verilog as much as possible. In terms of simulation, since VHDL is weaker than Verilog in behavioral modeling, very few people use VHDL for simulation models. However, this does not mean that Verilog is superior; in fact, Verilog also has limitations in complex behavioral modeling. For instance, it does not support arrays. In some complex algorithm designs, higher-level languages are needed to abstractly describe behavioral models. Abroad, simulation modeling is often done using System C and E languages, and using Verilog is considered quite outdated. Huawei’s verification platform seems to be written in System C.
Regarding system structure design, since my designs are not large enough, I cannot claim much experience. I only feel that one must possess some knowledge of computer system architecture.
The primary basis for division is functionality, followed by selecting appropriate bus structures, storage structures, and processor architectures. Through system structure division, clear functional modules should be established, making implementation easier. I plan to share some insights on this part after some time, so I won’t mislead everyone just yet.
Finally, let me briefly share my experiences: in summary, practice more, think more, and ask more. Practice leads to true knowledge; seeing someone else’s solution 100 times is not as good as practicing it yourself. The motivation for practice comes from both interest and pressure; I personally believe the latter is more important. Having demands can easily create pressure, meaning it’s best to exercise in actual project development rather than learning for the sake of learning.
During practice, think more about the causes of problems, and after solving problems, ask why several times. This is also a process of accumulating experience. If you have a habit of keeping a project log, that’s even better; write down problems, causes, and solutions. Finally, don’t hesitate to ask; if you can’t solve a problem after thinking about it, you should ask. After all, individual strength is limited. Ask classmates, colleagues, search engines, or online friends; a single article or a friend’s advice may help you quickly solve the problem.