0% found this document useful (0 votes)
129 views52 pages

Advanced FPGA Design Techniques

This document provides an overview of techniques for architecting digital designs to improve speed, including high throughput, low latency, and timing optimizations. It discusses how adding pipeline registers can increase throughput at the cost of latency, and how removing registers can reduce latency but increase critical path delays. Optimizing timing may involve adding register layers to break up long combinational paths. The document uses examples of exponentiation and FIR filters to illustrate these techniques.

Uploaded by

ifire
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
129 views52 pages

Advanced FPGA Design Techniques

This document provides an overview of techniques for architecting digital designs to improve speed, including high throughput, low latency, and timing optimizations. It discusses how adding pipeline registers can increase throughput at the cost of latency, and how removing registers can reduce latency but increase critical path delays. Optimizing timing may involve adding register layers to break up long combinational paths. The document uses examples of exponentiation and FIR filters to illustrate these techniques.

Uploaded by

ifire
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

WiCHIP Technologies

Engineering your ideas


Advanced FPGA Design
Architecturing Speed
Vu-Duc Ngo
[email protected]
Contents
• 1. Introduction to Architecting Speed
• 2. High Throughput
• 3. Low Latency
• 4. Timing
• 5. Summary
1. Introduction to Architecting Speed
Introduction to Architecting Speed
• When dealing with digital designs, optimizations are required
to meet design constraints.
• One of three primary physical characteristics of a digital design
is: speed
• Three primary definitions of speed:
• Throughput: amount of data processed / clock cycle (bits/s)
• Latency: time between data input – processed data output (time or
clock cycles)
• Timing: logic delays between sequential elements (clock period,
frequency)
Introduction to Architecting Speed
• To improve speed, we need different architectures:
• High-throughput architectures: maximizing number of processed
bits/second
• Low-latency architectures: minimizing delay from input  output
• Timing optimizations: reducing combinatorial delay of the critical
path
• Techniques for timing optimization:
• Adding register layers
• Using parallel structures
• Flattening logic structures
• Balancing registers
• Reordering paths
2. High Throughput
High Throughput
• High-throughput design is concerned:
• More about the steady-state data rate
• Less about latency (i.e., the time an specific piece of data requires to
propagate through the design)
• High-throughput design employs the so-called pipeline
architecture
• New data can begin processing before the prior data has finished 
improving system throughput
• Let us consider some examples to illustrate the beauty of
pipeline architecture
High Throughput
• Consider the following piece of code:
XPower = 1;
for (i=0; i < 3; i++)
XPower = X * XPower;

• This code is an iterative algorithm:


• Same variables and addresses are accessed until the completion of
computation
• We can also implement the above code in hardware using
Verilog as follows:
High Throughput
module power3(
output [7:0] XPower,
output finished,
input [7:0] X,
input clk, start); // the duration of start is a single clock
reg [7:0] ncount;
reg [7:0] XPower;
assign finished = (ncount == 0);
always@(posedge clk)
if(start) begin
XPower <= X;
ncount <= 2;
end
else if(!finished) begin
ncount <= ncount - 1;
XPower <= XPower * X;
end
endmodule
High Throughput
Same register and computational resources are used until the completion of computation

clk [7:0]

start
* [7:0]
[7:0]
0
[7:0] D[7:0] Q7:0] [7:0]
1 [7:0] E
X[7:0]
XPower[7:0]
[7:0]

Figure 1.1: Iterative Implementation

• No new computations can begin until the previous computation has


completed
High Throughput
• Handshaking signals are required: start, finished
• Performance of the above Verilog implementation:
• Through put = 8 bits/3 clocks = 2.7 bits/clock
• Latency = 3 clocks
• Timing = One multiplier delay in the critical path

• Let us have a look at a pipelined version of the same algorithm


High Throughput
module power3(
output reg [7:0] XPower,
input clk,
input [7:0] X
);
reg [7:0] XPower1, XPower2;
reg [7:0] X1, X2;
always @(posedge clk) begin
// Pipeline stage 1
X1 <= X;
XPower1 <= X;
// Pipeline stage 2
X2 <= X1;
XPower2 <= XPower1 * X1;
// Pipeline stage 3
XPower <= XPower2 * X2;
end
endmodule
High Throughput
• Pipelined implementation can be described by the following figure:
One multiplier delay in critical path

Q[7:0]
[7:0] Q[7:0] XPower[7:0]
[7:0]
[7:0]
clk Q[7:0] D[7:0] * D[7:0]
* [7:0]
X[7:0] D[7:0] [7:0] [7:0]
[7:0] Q[7:0]

D[7:0] 3rd clock: output X3


[7:0]
1st clock: compute X2
2nd clock: compute X*X2
2nd clock: compute next X2
3rd clock: compute next X*X2
3rd clock: compute next next X2

Figure 1.2: Pipelined Implementation


High Throughput
• Performance of the pipelined implementation:
• Throughput = 8 /1 or 8 bits/clock
• Latency = 3 clocks
• Timing = One multiplier delay in the critical path

• Clearly, this implementation has improved throughput by a factor of 3

• Unrolling an iterative loop increases throughput


• In general, unrolling an iteration loop, which has n loops, via the use of
pipelined implementation makes throughput increase by a factor of n

• Penalty: increase in area (more registers, more multipliers)


3. Low Latency
Low latency
• Low-latency design: passes data from the input to the output as quickly
as possible
• To achieve the goal, minimizing intermediate processing delays
• Often, a low-latency design requires:
• Parallelisms
• Removal of pipelining
• Logical short cuts
• Observing the previous pipelined implementation, we can see the
possibility of reducing latency:
• At each pipeline stage, the product of each multiply must wait for the
next clock edge to propagate to the next stage
• Therefore, we expect that removing pipeline registers  reducing
latency
Low latency
module power3(
output [7:0] XPower,
input [7:0] X
);
reg [7:0] XPower1, XPower2;
reg [7:0] X1, X2;
assign XPower = XPower2 * X2;
always @* begin
X1 = X;
XPower1 = X;
end
always @* begin
X2 = X1;
XPower2 = XPower1*X1;
end
endmodule
Low latency
• Low-latency implementation is illustrated as follows:

Two multiplier delays in the critical path


[7:0]
[7:0]
[7:0] XPower[7:0]
*
X[7:0]
[7:0] * [7:0]

Removal of pipelining:
registers no longer exist

Figure 1.3: Low-latency Implementation


Low latency
• Performance of this low-latency design:
• Throughput = 8 bits/clock (assuming one new input per clock)
• Latency = Between one and two multiplier delays, 0 clocks
• Timing = Two multiplier delays in the critical path

• Latency can be reduced by removing pipeline registers

• Penalty: increase in combinatorial delay


4. Timing
Timing
• Timing refers to the clock speed of a design
• Maximum delay between any two sequential elements in a design will
determine the max clock speed.
• Maximum speed, or maximum frequency, is defined as follows:
1
Fmax 
Tclk-q  Tlogic  Trouting  Tsetup  Tskew
Fmax : maximum allowable frequency for clock
• where
Tclk-q : time from clock arrival until data arrives at Q
Tlogic : propagation delay through logic between flipflops
Trouting : routing delay between flipflops
Tsetup : setup time
Tskew : propagation delay of clock
Timing – Add Register Layers
• Adding intermediate layers of registers to the critical path is the
first strategy for architectural timing improvement
• This technique should be used in highly pipelined designs, where:
• An additional clock cycle latency does not violate the design
specifications
• The overall functionality will not be affected by the further addition of
registers

• For example: assume the architecture for the following FIR


implementation does not meet timing:
Timing – Add Register Layers
module fir(
output [7:0] Y,
input [7:0] A, B, C, X,
input clk,
input validsample);
reg [7:0] X1, X2, Y;
always @(posedge clk)
if(validsample) begin
X1 <= X;
X2 <= X1;
Y <= A* X+B* X1+C* X2;
end
endmodule
Timing – Add Register Layers
• The above implementation is described by the following figure:
[7:0] [7:0]
A[7:0]

[7:0] * [7:0]
[7:0] [7:0]
B[7:0] Q[7:0] Y[7:0]
[7:0]
[7:0] * + D[7:0]
[7:0] E
C[7:0]

[7:0] [7:0]
*
clk Q[7:0] Q[7:0]
X[7:0] D[7:0] D[7:0]
[7:0] One adder delay + one multiplier delay in the
[7:0] [7:0] critical path is greater than the minimum clock
validsample E E
period requirement, i.e., timing is not met

Figure 1.4: MAC with long path


• In this implementation, all multiply/add operations occurs in one clock
cycle
Timing – Add Register Layers

• To resolve the issue, we can further pipeline this design


• By adding extra registers intermediate to the multipliers
• Under the assumption that: latency requirement is not fixed at 1 clock

• We just have to add a pipeline layer between the multipliers and the
adder
Timing – Add Register Layers
module fir(
output [7:0] Y,
input [7:0] A, B, C, X,
input clk,
input validsample);
reg [7:0] X1, X2, Y;
reg [7:0] prod1, prod2, prod3;
always @ (posedge clk) begin
if(validsample) begin
X1 <= X;
Registers added
X2 <= X1;
prod1 <= A * X;
prod2 <= B * X1;
prod3 <= C * X2;
end
Y <= prod1 + prod2 + prod3;
end
endmodule
Timing – Add Register Layers
• The above implementation is described by the following figure:
[7:0]
A[7:0] Q[7:0]
[7:0] Intermediate registers added
D[7:0]
[7:0] * E

[7:0] [7:0]
C[7:0] Q[7:0] Y[7:0]
Q[7:0] + D[7:0] [7:0]
* D[7:0] [7:0] [7:0] E
clk E
Q[7:0] Q[7:0]
X[7:0] D[7:0] D[7:0] [7:0]
validsample E E
Shorter delay in the critical path
 timing requirement is met
B[7:0] Q[7:0]
[7:0] [7:0] D[7:0]
* E
[7:0]

[7:0]
One multiplier delay One adder delay

Figure 1.5: Pipeline registers added


Timing – Add Register Layers

• Adding register layers improves timing by dividing the critical path into
two paths of smaller delay.

• Penalty: increase in latency and area


Timing – Parallel Structures
• Reorganizing the critical path such that logic structures are
implemented in parallel is the second strategy for architectural
timing improvement
• This technique should be used whenever:
• A function that currently evaluates through a serial string of logic
• But can be broken and evaluated in parallel

• For instance, assume that the standard pipelined power-of-3 design


discussed above does not meet timing
• To create parallel structures:
• Break up the multipliers into independent operations
• Recombine them to get results
Timing – Parallel Structures
• For example:
• To compute the square of an 8-bit binary number X, i.e., X2, we first
represent:
X   A, B
where A and B are the most and the least significant nibbles, respectively
• Then, the multiply operation can be reorganized as follows:

X * X   A, B * A, B   A * A  ,  2* A * B  ,  B * B 
• By doing like this, we have reduced our problem to a series of 4-bit
multiplications, that can be handled in parallel
• Recombining the products, we get the final result for the 8-bit
multiplication
Timing – Parallel Structures
• Applying this principle, we can re-implement the power-of-3 design as
follows:

module power3(
output [7:0] XPower,
input [7:0] X,
input clk);
reg [7:0] XPower1;
// partial product registers
reg [3:0] XPower2_ppAA, XPower2_ppAB, XPower2_ppBB;
reg [3:0] XPower3_ppAA, XPower3_ppAB, XPower3_ppBB;
reg [7:0] X1, X2;
wire [7:0] XPower2;
Timing – Parallel Structures
// nibbles for partial products (A: MS nibble, B: LS nibble)
wire [3:0] XPower1_A = XPower1[7:4];
wire [3:0] XPower1_B = XPower1[3:0];
wire [3:0] X1_A = X1[7:4];
wire [3:0] X1_B = X1[3:0];
wire [3:0] XPower2_A = XPower2[7:4];
wire [3:0] XPower2_B = XPower2[3:0]; Combine to get 8-bit results
wire [3:0] X2_A = X2[7:4];
wire [3:0] X2_B = X2[3:0];
// assemble partial products
assign XPower2 = (XPower2_ppAA << 8)+(2*XPower2_ppAB<< 4)
+XPower2_ppBB;
assign XPower = (XPower3_ppAA << 8)+(2*XPower3_ppAB << 4)
+XPower3_ppBB;
Timing – Parallel Structures
always @(posedge clk) begin
// Pipeline stage 1
X1 <= X;
XPower1 <= X; 4-bit products are handled in parallel
// Pipeline stage 2
X2 <= X1;
// create partial products
XPower2_ppAA <= XPower1_A * X1_A;
XPower2_ppAB <= XPower1_A * X1_B;
XPower2_ppBB <= XPower1_B * X1_B;
// Pipeline stage 3
// create partial products
XPower3_ppAA <= XPower2_A * X2_A;
XPower3_ppAB <= XPower2_A * X2_B;
XPower3_ppBB <= XPower2_B * X2_B;
end
endmodule
Timing – Flatten Logic Structures
• Flattening logic structures is the third strategy for architectural
timing improvement
• This technique is closely related to the idea of parallel structures, but
applies specifically to logic that is chained due to priority encoding.

• Normally, synthesis and layout tools are not smart enough to break up
logic structures coded in a serial fashion
• They also do not have enough information relating to the priority
requirements of the design
• Therefore, it is possible that they are unable to give us a configuration
with our desired timing
Timing – Flatten Logic Structures
• For example:
• Consider the following control signals coming from an address decode
that are used to write 4 registers:
module regwrite(
output reg [3:0] rout,
input clk, in,
input [3:0] ctrl); Control signals are coded with a
priority compared to other signals
always @(posedge clk)
if(ctrl[0]) rout[0] <= in;
else if(ctrl[1]) rout[1] <= in;
else if(ctrl[2]) rout[2] <= in;
else if(ctrl[3]) rout[3] <= in;
endmodule
Timing – Flatten Logic Structures
• The above type of priority encoding is implemented as follows:

Combinatorial logic gates


are used due to priority
encoding,  introducing delay

Figure 1.6: Priority encoding


Timing – Flatten Logic Structures
• Time delay is required for the priority logic  timing is not met
• Solution is to remove the priority and thereby flatten the logic

module regwrite(
output reg [3:0] rout,
input clk, in,
input [3:0] ctrl); Priority is removed. Control
always @(posedge clk) begin signals act independently
if(ctrl[0]) rout[0] <= in;
if(ctrl[1]) rout[1] <= in;
if(ctrl[2]) rout[2] <= in;
if(ctrl[3]) rout[3] <= in;
end
endmodule
Timing – Flatten Logic Structures
clk

ctrl[3:0]
[3:0] [3]
[3] 0

in 1
[3:0]
[2] Q[7:0] rout[3:0]
[2] 0 D[7:0] [3:0]

[1]
[1] 0

1
Gates are removed. Each of the control
[0]
signals acts independently and controls
0
its corresponding rout bits independently [0]

Figure 1.7: No priority encoding


Timing – Flatten Logic Structures

• By removing priority encodings where they are not needed, the logic
structure is flattened and the path delay is reduced
Timing – Register Balancing
• Register balancing is the fourth strategy for architectural timing
improvement
• Idea of register balancing: redistribute logic evenly between
registers to minimize worse-case delay between any two registers
• This technique should be used whenever:
• Logic is highly imbalanced between the critical path and adjacent
path

• Although many synthesis tools have the so-called register balancing


optimization, they can perform just simple cases in a predetermined
fashion
• Thus, designers must have the ability to redistribute logic in their own
way to reduce worse-case delay
Timing – Register Balancing
• Consider the following code for an adder that adds three 8-bits inputs:

module adder(
output reg [7:0] Sum,
input [7:0] A, B, C,
input clk);
reg [7:0] rA, rB, rC;
always @(posedge clk) begin
rA <= A;
rB <= B;
rC <= C;
Sum <= rA + rB + rC;
end
endmodule
Timing – Register Balancing

Sum in the second stage,


determines worse-case delay

No logic in the
1st register stage
Figure 1.8: Registered adder
Timing – Register Balancing
• Some of the logic in the critical path can be moved back a stage, thereby
balancing the logic load between two register stages.

module adder(
output reg [7:0] Sum,
input [7:0] A, B, C,
input clk);
reg [7:0] rABSum, rC;
always @(posedge clk) begin
rABSum <= A + B;
rC <= C;
Sum is moved back a stage
Sum <= rABSum + rC;
end
endmodule
Timing – Register Balancing

One of the add operation is Balance the logic between states


moved back a stage and reduce delay in the critical path

Figure 1.9: Registered balanced


Timing – Reorder Paths
• Reordering the paths in the data flow to minimize the critical
path is the fifth strategy

• This technique should be used whenever:


• Multiple paths combine with the critical path
• The combined path can be reordered such that the critical path
becomes shorter
Timing – Reorder Paths
• Consider the following module:

module randomlogic(
output reg [7:0] Out,
input [7:0] A, B, C,
input clk,
input Cond1, Cond2);
always @(posedge clk)
if(Cond1)
Out <= A;
else if(Cond2 && (C < 8))
Out <= B;
else
Out <= C;
endmodule
Timing – Reorder Paths

Critical path is between C and Out: comparator in series with two gates

Figure 1.10: Long critical path


Timing – Reorder Paths
• We can modify the code to reorder the long delay of the comparator:

module randomlogic(
output reg [7:0] Out,
input [7:0] A, B, C,
input clk,
input Cond1, Cond2);
wire CondB = (Cond2 & !Cond1);
always @(posedge clk)
if(CondB && (C < 8))
Out <= B;
else if(Cond1)
Out <= A;
else
Out <= C;
endmodule
Timing – Reorder Paths

Gate is moved out the critical path

Shorter delay, meet timing requirement

Figure 1.11: Logic reordered to reduce critical path


5. Summary
Summary
• A high-throughput architecture: maximizes the number of
bits per second that can be processed by a design
• Unrolling an iterative loop increases throughput
• Penalty: a proportional increase in area
• A low-latency architecture: minimizes the delay from the input
of a module to the output
• Latency can be reduced by removing pipeline registers
• Penalty: an increase in combinatorial delay between registers
• Timing refers to the clock speed of a design
Summary
• A design meets timing: maximum delay between any two
sequential elements < minimum clock period
• Timing can be improved by:
• Add register layers
• Parallel structures
• Flatten logic structures
• Register Balancing
• Reorder path

You might also like