Design Techniques and Principles for FPGA/CPLD

Welcome FPGA engineers to join the official WeChat technical group.

Clickthe blue textto follow us at FPGA Home – the best and largest pure FPGA engineering community in China.

Cover image: Intel FPGA products, source: network Design Techniques and Principles for FPGA/CPLD

This article discusses four common design concepts and techniques for FPGA/CPLD: ping-pong operation, serial-parallel conversion, pipeline operation, and data interface synchronization. These are all reflections of the inherent laws of FPGA/CPLD logic design. Reasonably applying these design concepts can achieve twice the result with half the effort in FPGA/CPLD design work.

The design concepts and techniques for FPGA/CPLD is a very broad topic. Due to space limitations, this article will only introduce some commonly used design concepts and techniques, including ping-pong operation, serial-parallel conversion, pipeline operation, and synchronization methods for data interfaces. I hope this article can attract the attention of engineers, and if these principles are consciously utilized to guide future design work, it will yield twice the result with half the effort!

Ping-Pong Operation

“Ping-pong operation” is a processing technique often used in data flow control. A typical ping-pong operation method is shown in Figure 1.

Figure 1: Ping-pong operation schematic

The processing flow of the ping-pong operation is as follows: the input data stream is distributed in real-time to two data buffers through the “input data selection unit”. The data buffer module can be any storage module. Commonly used storage units include dual-port RAM (DPRAM), single-port RAM (SPRAM), FIFO, etc. In the first buffering cycle, the input data stream is cached into “data buffer module 1”; in the second buffering cycle, through the switching of the “input data selection unit”, the input data stream is cached into “data buffer module 2”, while the first cycle data cached by “data buffer module 1” is sent to the “data flow computation processing module” for computation; in the third buffering cycle, through the switching of the “input data selection unit”, the input data stream is cached into “data buffer module 1”, while the second cycle data cached by “data buffer module 2” is sent to the “data flow computation processing module” for computation. This process repeats in a loop.

The greatest feature of the ping-pong operation is that through the coordinated switching of the “input data selection unit” and the “output data selection unit”, the buffered data stream is sent to the “data flow computation processing module” for computation and processing without any pause. When viewing data from both ends of the ping-pong operation module as a whole, both the input data stream and the output data stream are continuous without any interruptions, making it very suitable for pipeline processing of data streams. Therefore, the ping-pong operation is often applied in pipeline algorithms to achieve seamless buffering and processing of data.

The second advantage of the ping-pong operation is that it can save buffer space. For example, in WCDMA baseband applications, one frame consists of 15 time slots, and sometimes it is necessary to delay the processing of a complete frame of data by one time slot. A straightforward approach is to cache the entire frame of data and then process it after delaying by one time slot. In this case, the length of the buffer would be the entire frame of data long. Assuming the data rate is 3.84 Mbps and the frame length is 10 ms, the buffer length required would be 38400 bits. If using the ping-pong operation, only two RAMs capable of buffering one time slot of data are needed (single-port RAM is sufficient). When writing data to one RAM, data is read from the other RAM and sent to the processing unit. Each RAM only needs a capacity of 2560 bits, so the total capacity of the two RAMs is only 5120 bits.

Additionally, cleverly using ping-pong operation can also achieve the effect of processing high-speed data streams with low-speed modules. As shown in Figure 2, the data buffer module uses dual-port RAM and introduces a first-level data preprocessing module after DPRAM. This data preprocessing can perform various required data operations, such as despreading, deinterleaving, and derotation in WCDMA design. Assuming the input data stream rate at port A is 100 Mbps, the ping-pong operation buffering cycle is 10 ms. Below is an analysis of the data rates at each node port.

Using dual-port RAM and introducing a first-level data preprocessing module to achieve processing high-speed data streams with low-speed modules

Figure 2: Using dual-port RAM and introducing a first-level data preprocessing module to achieve processing high-speed data streams with low-speed modules

The input data stream rate at port A is 100 Mbps. During the first buffering cycle of 10 ms, through the “input data selection unit”, data flows from B1 to DPRAM1. The data rate of B1 is also 100 Mbps, and DPRAM1 must write 1 Mb of data within 10 ms. Similarly, during the second 10 ms, the data stream switches to DPRAM2, with port B2’s data rate also being 100 Mbps, and DPRAM2 is written with 1 Mb of data in the second 10 ms. In the third 10 ms, the data stream switches back to DPRAM1, which is again written with 1 Mb of data.

Upon careful analysis, it can be seen that during the third buffering cycle, DPRAM1 has a total of 20 ms to read data and send it to “data preprocessing module 1”. Some engineers are puzzled as to why the reading time of DPRAM1 is 20 ms. This time is derived as follows: First, during the 10 ms of writing data to DPRAM2 in the second buffering cycle, DPRAM1 can perform read operations; additionally, starting from the 5 ms of the first buffering cycle (at absolute time of 5 ms), DPRAM1 can write data to addresses beyond 500K while simultaneously reading from address 0. By the time 10 ms is reached, DPRAM1 has finished writing 1 Mb of data and has read 500K of data, so during this buffering time, DPRAM1 has read 5 ms; similarly, starting from the 5 ms of the third buffering cycle (at absolute time of 35 ms), DPRAM1 can also write data to addresses beyond 500K while reading from address 0, thus reading another 5 ms. Therefore, by the time the data stored in DPRAM1 from the first cycle is completely overwritten, DPRAM1 can read data for a maximum of 20 ms, and the required data to read is 1 Mb, so the data rate at port C1 is: 1 Mb/20 ms = 50 Mbps. Thus, the minimum data throughput capability required for “data preprocessing module 1” is also only 50 Mbps. Similarly, the minimum data throughput capability required for “data preprocessing module 2” is also only 50 Mbps. In other words, through the ping-pong operation, the timing pressure on the “data preprocessing module” is reduced, and the required data processing rate is only half of the input data rate.

The essence of achieving low-speed module processing of high-speed data through ping-pong operation is that the data flow is converted from serial to parallel via the DPRAM buffer unit, and the data is processed in parallel using “data preprocessing module 1” and “data preprocessing module 2”, which reflects the principle of area-speed trade-off!

Serial-Parallel Conversion Design Techniques

Serial-parallel conversion is an important technique in FPGA design. It is a common means of data flow processing and a direct reflection of the area-speed trade-off concept.

There are various methods to implement serial-parallel conversion. Depending on the sorting and quantity requirements of the data, registers, RAM, etc., can be used for implementation. In the previous example of ping-pong operation, the data flow was converted from serial to parallel using DPRAM, and because DPRAM was used, the buffer area can be made quite large. For designs with a relatively small quantity, registers can be used to complete the serial-parallel conversion. Unless there are special requirements, synchronous timing design should be used to complete the conversion between serial and parallel. For example, for data from serial to parallel, if the data arrangement order is high bits first, the following encoding can be used:

prl_temp<={prl_temp,srl_in};

Where prl_temp is the parallel output buffer register, and srl_in is the serial data input.

For serial-parallel conversions with specified arrangement orders, case statements can be used to determine the implementation. For more complex serial-parallel conversions, state machines can also be used. The methods for serial-parallel conversion are relatively simple and will not be elaborated on here.

Pipeline Operation Design Concepts

First, it should be stated that the pipeline discussed here refers to a design concept of processing flow and sequential operation, not the “Pipelining” used in FPGA or ASIC design for timing optimization.

Pipeline processing is a common design technique in high-speed designs. If the processing flow of a certain design is divided into several steps and the entire data processing is “single-directional”, meaning there is no feedback or iterative computation, with the output of the previous step being the input of the next step, then pipeline design methods can be considered to improve the working frequency of the system.

Figure 3: Schematic diagram of pipeline design structure

The schematic diagram of the pipeline design structure is shown in Figure 3. Its basic structure involves suitably dividing n operation steps in a single-directional series. The greatest feature and requirement of pipeline operation is that the data flow in each step’s processing is continuous over time. If each operation step is simplified to be processed through a D flip-flop (which is using a register to clock), then pipeline operation is similar to a shift register group, with the data flow sequentially passing through the D flip-flops to complete each step’s operation. The timing of the pipeline design is shown in Figure 4.

Figure 4: Timing diagram of pipeline design

A key to pipeline design is the reasonable arrangement of the entire design timing, requiring that each operation step is appropriately divided. If the operation time of the previous operation equals that of the subsequent operation, the design is the simplest, and the output of the previous step can be directly fed into the input of the next step; if the operation time of the previous step is greater than that of the subsequent step, then the output data from the previous step needs to be buffered to be fed into the input of the next step; if the operation time of the previous step is less than that of the subsequent step, then logic duplication must be used to split the data flow, or data must be stored and post-processed in the previous step, otherwise data overflow may occur in the subsequent step.

Pipeline processing is often used in WCDMA designs, such as in RAKE receivers, searchers, and preamble capture. The reason pipeline processing is frequently used is that it replicates processing modules, which is another specific manifestation of the area-speed trade-off concept.

Data Interface Synchronization Methods

Data interface synchronization is a common issue in FPGA/CPLD design and is also a key and difficult point. Many unstable designs stem from synchronization issues in data interfaces.

During the circuit design phase, some engineers manually add BUFTs or inverters to adjust data delays, ensuring that the clock of the current module meets the setup and hold time requirements of the upper-level module data. Other engineers generate many clock signals that are 90 degrees out of phase for stable sampling, sometimes using the positive edge to sample data and sometimes using the negative edge to adjust the sampling position. Both of these practices are highly undesirable because once the chip is updated or migrated to other chip groups, the sampling implementation must be redesigned. Furthermore, these practices create insufficient margins in circuit implementation, and once external conditions change (e.g., increased temperature), the sampling timing may become completely chaotic, leading to circuit failure.

Below are a few different data interface synchronization methods under various conditions:

1. How to complete data synchronization when delays (between chips, PCB wiring, and delays of some drive interface components, etc.) are unpredictable or may change?

For unpredictable or changing data delays, a synchronization mechanism needs to be established, which can use a synchronization enable or synchronization indication signal. Additionally, storing data in RAM or FIFO can also achieve data synchronization.

The method of storing data in RAM or FIFO is as follows: the data provided by the upper-level chip is written into RAM or FIFO using a clock signal that travels along with the data, and then the data is read out using the sampling clock of the current level (usually the main clock of data processing). The key to this approach is that the data must be reliably written into RAM or FIFO. If synchronous RAM or FIFO is used, a signal that has a fixed delay relationship to the data must be provided, which can be an indication of data validity or a clock signal that indicates when the upper-level module outputs the data. For slow data, asynchronous RAM or FIFO can also be sampled, but this practice is not recommended.

Data is arranged in a fixed format, and many important pieces of information are located at the beginning of the data. This situation is very common in communication systems. In communication systems, many data are organized in “frames”. Due to the high clock requirements of the entire system, a clock board is often specially designed to generate and drive high-precision clocks. Since data has a starting position, how can data synchronization be achieved and how can the data “head” be found?

Data synchronization methods can completely use the above methods, employing synchronization indication signals or using RAM or FIFO for buffering.

There are two methods for finding the data head. The first method is straightforward: simply transmit a signal indicating the start position of the data along with the data. For some systems, especially asynchronous systems, a synchronization code (such as a training sequence) is often inserted into the data, allowing the receiving end to detect the synchronization code through a state machine to find the data “head”. This method is called “blind detection”.

The clocks of the upper-level data and the current level are asynchronous, meaning that the clocks of the upper-level chip or module and the current-level chip or module are in asynchronous clock domains.

A principle that has been briefly introduced in data input synchronization is: if the clock of the input data is the same frequency as the processing clock of the current chip, the main clock of the current chip can be directly used to sample the input data register to complete the input data synchronization; if the input data and the processing clock of the current chip are asynchronous, especially when the frequencies do not match, then the input data must be sampled twice using the processing clock to complete the input data synchronization. It should be noted that using registers to sample data from asynchronous clock domains twice effectively prevents metastability (unstable data states) from propagating, ensuring that the data processed by the subsequent circuits are valid levels. However, this method does not guarantee that the data sampled by the two registers is correct; this approach generally produces a certain number of incorrect level data. Therefore, it is only suitable for functional units that are not sensitive to a small number of errors.

To avoid incorrect sampling levels arising from asynchronous clock domains, it is generally recommended to use RAM or FIFO buffering methods to complete the data conversion between asynchronous clock domains. The most commonly used buffer unit is DPRAM, which writes data using the upper-level clock at the input port and reads data using the current-level clock at the output port, thus conveniently completing data exchange between asynchronous clock domains.

2. Is it necessary to add constraints for data interface synchronization design?

It is recommended to add appropriate constraints, especially for high-speed designs, to ensure that corresponding constraints are applied to periods, setup, and hold times.

The role of adding these constraints has two points:

a. To improve the working frequency of the design and meet the requirements for data interface synchronization. By adding constraints for periods, setup time, and hold time, logic synthesis, mapping, layout, and routing can be controlled to reduce logic and routing delays, thereby increasing the working frequency and meeting data interface synchronization requirements.

b. To obtain correct timing analysis reports. Almost all FPGA design platforms include static timing analysis tools. Using these tools can yield timing analysis reports after mapping or layout and routing, allowing for performance evaluation of the design. Static timing analysis tools use constraints as standards to determine whether the timing meets design requirements; thus, designers are required to input constraints correctly so that static timing analysis tools can produce accurate timing analysis reports.

Welcome communication engineers and FPGA engineers to follow our public account.

The largest national FPGA WeChat technical group

Welcome everyone to join the national FPGA WeChat technical group, which has tens of thousands of engineers and a group of technology-loving engineers. Here, FPGA engineers help each other, share experiences, and the technical atmosphere is strong!Quickly invite your friends to join!!

Design Techniques and Principles for FPGA/CPLD

Just press and hold to join the national FPGA technical group.

FPGA Home Component City

Advantageous component services, please scan to contact the group owner: Jin Juan Email: [email protected] Welcome to recommend to procurement.

ACTEL, AD some advantageous orders (operating the full series):

Design Techniques and Principles for FPGA/CPLD

XILINX, ALTERA advantageous stock or orders (operating the full series):

Design Techniques and Principles for FPGA/CPLD

(The above components are part of the models, for more models please consult group owner Jin Juan)

Service concept: FPGA Home Component City aims to facilitate engineers to quickly and conveniently purchase components. After years of dedicated service, our customer service is spread across large domestic listed companies, military research units, and small and medium enterprises. Our greatest advantage is emphasizing the service-first concept and achieving quick delivery with favorable prices!

Direct sales brands: Xilinx, ALTERA, ADI, TI, NXP, ST, E2V, Micron, and over a hundred other component brands, especially good at handling components subject to US embargo against China.We welcome engineer friends to recommend us to procurement or consult us directly!We will continue to provide the best service in the industry!

Design Techniques and Principles for FPGA/CPLD

Official thanks to FPGA technology group brands: Xilinx, Intel (Altera), Microsemi (Actel), Lattice, Vantis, Quicklogic, Lucent, etc.

Ping-Pong Operation

Serial-Parallel Conversion Design Techniques

Pipeline Operation Design Concepts

Data Interface Synchronization Methods

Leave a Comment Cancel reply