Welcome FPGA engineers to join the official WeChat group.
Implement a module on FPGA to find the maximum value and second maximum value among 32 inputs, given in one clock cycle. (This question is from a forum, an interview question; if you find it inappropriate, please leave a message to delete it.)
Clickthe blue wordsto follow us, FPGA Home – the best and largest pure FPGA engineering community in China
From my personal point of view, this is a very good interview question:
-
Firstly, this is probably a simplification of problems encountered in some machine learning algorithms, making it a meaningful question;
-
Secondly, this question requires not only FPGA coding skills but also many possibilities for algorithm optimization;
Of course, the bit width of the input may affect the final solution and implementation possibilities. However, within a certain range, such as 8 or 32, the solution should be consistent, only affecting the final frequency. The following text will analyze this question in detail. (The question does not specify how to handle duplicate elements; here it is assumed that the maximum and second maximum values can be the same, i.e., counting duplicate elements.)
1. Solution
From the algorithm itself, the process of finding the maximum and second maximum values is very simple; through two traversals: the first to find the maximum value, the second to find the second maximum value; the algorithm complexity is O(2n). It is obviously impossible for FPGA to complete such a complex operation in one cycle, and generally requires pipeline design. Under this method, the entire structure is as follows:
-
By comparison, find the maximum value, using a pipeline to compare pairs, 32-16-8-4-2-1 can find the maximum value with 5 clock delays;
-
Since the second maximum value needs to be determined, it is necessary to maintain the coordinates of the maximum value during the process of finding the maximum value;
-
Clear the value at the maximum coordinate (set to minimum)
-
By pipeline implementation of pairwise comparisons, 32-16-8-4-2-1, after 5 clock delays can find the second maximum value;
This solution has several drawbacks, including: the delay for finding the maximum and second maximum values requires 5 clock delays each, the total delay will exceed 10 cycles; resource usage is relatively high, maintaining the coordinates of the maximum value and the clearing operation consume a lot of resources, and to calculate the second maximum value, it is necessary to store inputs for several cycles, which consumes a lot of registers.
Another idea is to consider finding the maximum and second maximum values simultaneously. Since this logic is relatively complex, it can be pipelined, as shown in the figure below. (For 8 inputs, 32 inputs need to increase two levels)
In this, the sort module completes the sorting of 4 inputs, obtaining the maximum and second maximum values output. Sorting 4 numbers is relatively complex, and this process takes about 2-3 cycles to complete. For 32 inputs, the input data goes through 32-16-8-4-2 to get the result, with a delay of about 10 cycles.
2. Divide and Conquer
If you need to implement a specific algorithm on FPGA, then find a suitable method to implement it; but if you want to achieve a specific function, then you need to find an excellent and suitable method for FPGA implementation.
Finding the maximum and second maximum values is a very incomplete sorting process, with a simple search complexity of O(2n), which is not conducive to hardware implementation. For sorting, whether it is quicksort or merge sort, both use the divide and conquer approach. If we try to solve this problem using the divide and conquer method, consider when there are only 2 inputs, one comparison can yield the output, resulting in an ordered array of length 2. If there are two ordered arrays, then two comparisons can yield the maximum and second maximum values. Using the merge sort approach, the complexity of finding the maximum and second maximum values is O(1.5n) (i.e., n/2+n/2+n/4…, not sure if I calculated it wrong). Using the merge sort approach is more efficient from an algorithmic time complexity perspective.
So is this solution suitable for FPGA implementation? The answer is definitely yes. The locality of divide and conquer is suitable for FPGA’s pipeline implementation, as shown in the block diagram below. (For 8 inputs, 32 inputs need to increase two levels)
In this, the meg module has two levels of comparators, generally, it can be completed in 1 clock cycle, and the input data goes through 32-32-16-8-4-2 to get the result, with a delay of 5 clock cycles. The implementation code is as follows:
Code
-
module test#(
-
parameter DW = 8
-
)
-
(
-
input clk,
-
input [32*DW-1 :0] din,
-
output [DW-1:0] max1,
-
output [DW-1:0] max2
-
);
-
wire[DW-1:0] d[31:0];
-
generate
-
genvar i;
-
for(i=0;i<32;i=i+1)
-
begin:loop_assign
-
assign d[i] = din[DW*i+DW-1:DW*i];
-
end
-
endgenerate
-
// stage 1,comp
-
reg[DW-1:0] s1_max[15:0];
-
reg[DW-1:0] s1_min[15:0];
-
generate
-
for(i=0;i<16;i=i+1)
-
begin:loop_comp
-
always@(posedge clk)
-
if(d[2*i]>d[2*i+1])begin
-
s1_max[i] <= d[2*i];
-
s1_min[i] <= d[2*i+1];
-
end
-
else begin
-
s1_max[i] <= d[2*i+1];
-
s1_min[i] <= d[2*i];
-
end
-
end
-
endgenerate
-
// stage 2,
-
wire[DW-1:0] s2_max[7:0];
-
wire[DW-1:0] s2_min[7:0];
-
generate
-
for(i=0;i<8;i=i+1)
-
begin:loop_megs2
-
meg u_s2meg(
-
.clk(clk),
-
.g1_max(s1_max[2*i]),
-
.g1_min(s1_min[2*i]),
-
.g2_max(s1_max[2*i+1]),
-
.g2_min(s1_min[2*i+1]),
-
.max1(s2_max[i]),
-
.max2(s2_min[i])
-
);
-
end
-
endgenerate
-
// stage 3,
-
wire[DW-1:0] s3_max[3:0];
-
wire[DW-1:0] s3_min[3:0];
-
generate
-
for(i=0;i<4;i=i+1)
-
begin:loop_megs3
-
meg u_s3meg(
-
.clk(clk),
-
.g1_max(s2_max[2*i]),
-
.g1_min(s2_min[2*i]),
-
.g2_max(s2_max[2*i+1]),
-
.g2_min(s2_min[2*i+1]),
-
.max1(s3_max[i]),
-
.max2(s3_min[i])
-
);
-
end
-
endgenerate
-
// stage 4,
-
wire[DW-1:0] s4_max[1:0];
-
wire[DW-1:0] s4_min[1:0];
-
generate
-
for(i=0;i<2;i=i+1)
-
begin:loop_megs4
-
meg u_s4meg(
-
.clk(clk),
-
.g1_max(s3_max[2*i]),
-
.g1_min(s3_min[2*i]),
-
.g2_max(s3_max[2*i+1]),
-
.g2_min(s3_min[2*i+1]),
-
.max1(s4_max[i]),
-
.max2(s4_min[i])
-
);
-
end
-
endgenerate
-
// stage 5,
-
meg u_s5meg(
-
.clk(clk),
-
.g1_max(s4_max[0]),
-
.g1_min(s4_min[0]),
-
.g2_max(s4_max[1]),
-
.g2_min(s4_min[1]),
-
.max1(max1),
-
.max2(max2)
-
);
-
endmodule
-
module meg#(
-
parameter DW = 8
-
)
-
(
-
input clk,
-
input [DW-1 :0] g1_max,
-
input [DW-1 :0] g1_min,
-
input [DW-1 :0] g2_max,
-
input [DW-1 :0] g2_min,
-
output reg [DW-1:0] max1,
-
output reg [DW-1:0] max2
-
);
-
always@(posedge clk)
-
begin
-
if(g1_max>g2_max) begin
-
max1 <= g1_max;
-
if(g2_max>g1_min)
-
max2 <= g2_max;
-
else
-
max2 <= g1_min;
-
end
-
else begin
-
max1 <= g2_max;
-
if(g1_max>g2_min)
-
max2 <= g1_max;
-
else
-
max2 <= g2_min;
-
end
-
end
-
endmodule
3. Others
Simple testing of the above code on the previous generation device (20nm FPGA), the 8bit data input module can synthesize to a very high frequency, with a logic level of about 5 levels. For the entire project, the bottleneck will not occur in this part. The 32bit data input frequency will not be very high due to the large data width, but by making the meg module one level pipeline, it will hardly become the bottleneck of the entire system.
In the case of 32-bit 32 inputs, the data input width is 1024 (not IO input, but internal signal). Previously, such a large width of data may not be used in communication/digital signal processing, but for AI applications on FPGA, thousands of bits of input should be very common, which will indeed affect the final implementation effect on FPGA. To make machine learning algorithms run better on FPGA, both algorithms and FPGA need to work together.

Welcome communication engineers and FPGA engineers to follow the public account

The largest FPGA WeChat technology group in the country
Welcome everyone to join the national FPGA WeChat technology group, this group has tens of thousands of engineers, a group of engineers who love technology, here FPGA engineers help each other, share, with a strong technical atmosphere!Quickly call your friends to join!!
Press and hold to join the national FPGA technology group
FPGA Home Components City
Advantageous component services, if needed, please scan the code to contact the group owner: Jin Juan Email: [email protected] Welcome to recommend to procurement
ACTEL, AD part of the advantageous ordering (operating the full series):
XILINX, ALTERA advantageous spot or ordering (operating the full series):
(The above components are part of the models, for more models please consult the group owner Jin Juan)
Service concept: FPGA Home Components self-operated City aims to facilitate engineers to quickly and conveniently purchase component services. After several years of sincere service, our customer service is spread across large listed companies, military scientific research units, and small and medium-sized enterprises. The biggest advantage is emphasizing the service-oriented concept, and achieving fast delivery and favorable prices!
Direct brand: Xilinx ALTERA ADI TI NXP ST E2V, and more than a hundred component brands, especially good at components subject to US embargo against China,Welcome engineer friends to recommend us to procurement or consult us directly!We will continue to provide the best service in the industry!
FPGA technology group official thanks to brands: Xilinx, intel (Altera), microsemi (Actel), LattIC e, Vantis, Quicklogic, Lucent, etc.