WiCHIP Technologies
Engineering your ideas
Advanced FPGA Design
Architecturing Speed
Vu-Duc Ngo
[email protected]Contents
• 1. Introduction to Architecting Speed
• 2. High Throughput
• 3. Low Latency
• 4. Timing
• 5. Summary
1. Introduction to Architecting Speed
Introduction to Architecting Speed
• When dealing with digital designs, optimizations are required
to meet design constraints.
• One of three primary physical characteristics of a digital design
is: speed
• Three primary definitions of speed:
• Throughput: amount of data processed / clock cycle (bits/s)
• Latency: time between data input – processed data output (time or
clock cycles)
• Timing: logic delays between sequential elements (clock period,
frequency)
Introduction to Architecting Speed
• To improve speed, we need different architectures:
• High-throughput architectures: maximizing number of processed
bits/second
• Low-latency architectures: minimizing delay from input output
• Timing optimizations: reducing combinatorial delay of the critical
path
• Techniques for timing optimization:
• Adding register layers
• Using parallel structures
• Flattening logic structures
• Balancing registers
• Reordering paths
2. High Throughput
High Throughput
• High-throughput design is concerned:
• More about the steady-state data rate
• Less about latency (i.e., the time an specific piece of data requires to
propagate through the design)
• High-throughput design employs the so-called pipeline
architecture
• New data can begin processing before the prior data has finished
improving system throughput
• Let us consider some examples to illustrate the beauty of
pipeline architecture
High Throughput
• Consider the following piece of code:
XPower = 1;
for (i=0; i < 3; i++)
XPower = X * XPower;
• This code is an iterative algorithm:
• Same variables and addresses are accessed until the completion of
computation
• We can also implement the above code in hardware using
Verilog as follows:
High Throughput
module power3(
output [7:0] XPower,
output finished,
input [7:0] X,
input clk, start); // the duration of start is a single clock
reg [7:0] ncount;
reg [7:0] XPower;
assign finished = (ncount == 0);
always@(posedge clk)
if(start) begin
XPower <= X;
ncount <= 2;
end
else if(!finished) begin
ncount <= ncount - 1;
XPower <= XPower * X;
end
endmodule
High Throughput
Same register and computational resources are used until the completion of computation
clk [7:0]
start
* [7:0]
[7:0]
0
[7:0] D[7:0] Q7:0] [7:0]
1 [7:0] E
X[7:0]
XPower[7:0]
[7:0]
Figure 1.1: Iterative Implementation
• No new computations can begin until the previous computation has
completed
High Throughput
• Handshaking signals are required: start, finished
• Performance of the above Verilog implementation:
• Through put = 8 bits/3 clocks = 2.7 bits/clock
• Latency = 3 clocks
• Timing = One multiplier delay in the critical path
• Let us have a look at a pipelined version of the same algorithm
High Throughput
module power3(
output reg [7:0] XPower,
input clk,
input [7:0] X
);
reg [7:0] XPower1, XPower2;
reg [7:0] X1, X2;
always @(posedge clk) begin
// Pipeline stage 1
X1 <= X;
XPower1 <= X;
// Pipeline stage 2
X2 <= X1;
XPower2 <= XPower1 * X1;
// Pipeline stage 3
XPower <= XPower2 * X2;
end
endmodule
High Throughput
• Pipelined implementation can be described by the following figure:
One multiplier delay in critical path
Q[7:0]
[7:0] Q[7:0] XPower[7:0]
[7:0]
[7:0]
clk Q[7:0] D[7:0] * D[7:0]
* [7:0]
X[7:0] D[7:0] [7:0] [7:0]
[7:0] Q[7:0]
D[7:0] 3rd clock: output X3
[7:0]
1st clock: compute X2
2nd clock: compute X*X2
2nd clock: compute next X2
3rd clock: compute next X*X2
3rd clock: compute next next X2
Figure 1.2: Pipelined Implementation
High Throughput
• Performance of the pipelined implementation:
• Throughput = 8 /1 or 8 bits/clock
• Latency = 3 clocks
• Timing = One multiplier delay in the critical path
• Clearly, this implementation has improved throughput by a factor of 3
• Unrolling an iterative loop increases throughput
• In general, unrolling an iteration loop, which has n loops, via the use of
pipelined implementation makes throughput increase by a factor of n
• Penalty: increase in area (more registers, more multipliers)
3. Low Latency
Low latency
• Low-latency design: passes data from the input to the output as quickly
as possible
• To achieve the goal, minimizing intermediate processing delays
• Often, a low-latency design requires:
• Parallelisms
• Removal of pipelining
• Logical short cuts
• Observing the previous pipelined implementation, we can see the
possibility of reducing latency:
• At each pipeline stage, the product of each multiply must wait for the
next clock edge to propagate to the next stage
• Therefore, we expect that removing pipeline registers reducing
latency
Low latency
module power3(
output [7:0] XPower,
input [7:0] X
);
reg [7:0] XPower1, XPower2;
reg [7:0] X1, X2;
assign XPower = XPower2 * X2;
always @* begin
X1 = X;
XPower1 = X;
end
always @* begin
X2 = X1;
XPower2 = XPower1*X1;
end
endmodule
Low latency
• Low-latency implementation is illustrated as follows:
Two multiplier delays in the critical path
[7:0]
[7:0]
[7:0] XPower[7:0]
*
X[7:0]
[7:0] * [7:0]
Removal of pipelining:
registers no longer exist
Figure 1.3: Low-latency Implementation
Low latency
• Performance of this low-latency design:
• Throughput = 8 bits/clock (assuming one new input per clock)
• Latency = Between one and two multiplier delays, 0 clocks
• Timing = Two multiplier delays in the critical path
• Latency can be reduced by removing pipeline registers
• Penalty: increase in combinatorial delay
4. Timing
Timing
• Timing refers to the clock speed of a design
• Maximum delay between any two sequential elements in a design will
determine the max clock speed.
• Maximum speed, or maximum frequency, is defined as follows:
1
Fmax
Tclk-q Tlogic Trouting Tsetup Tskew
Fmax : maximum allowable frequency for clock
• where
Tclk-q : time from clock arrival until data arrives at Q
Tlogic : propagation delay through logic between flipflops
Trouting : routing delay between flipflops
Tsetup : setup time
Tskew : propagation delay of clock
Timing – Add Register Layers
• Adding intermediate layers of registers to the critical path is the
first strategy for architectural timing improvement
• This technique should be used in highly pipelined designs, where:
• An additional clock cycle latency does not violate the design
specifications
• The overall functionality will not be affected by the further addition of
registers
• For example: assume the architecture for the following FIR
implementation does not meet timing:
Timing – Add Register Layers
module fir(
output [7:0] Y,
input [7:0] A, B, C, X,
input clk,
input validsample);
reg [7:0] X1, X2, Y;
always @(posedge clk)
if(validsample) begin
X1 <= X;
X2 <= X1;
Y <= A* X+B* X1+C* X2;
end
endmodule
Timing – Add Register Layers
• The above implementation is described by the following figure:
[7:0] [7:0]
A[7:0]
[7:0] * [7:0]
[7:0] [7:0]
B[7:0] Q[7:0] Y[7:0]
[7:0]
[7:0] * + D[7:0]
[7:0] E
C[7:0]
[7:0] [7:0]
*
clk Q[7:0] Q[7:0]
X[7:0] D[7:0] D[7:0]
[7:0] One adder delay + one multiplier delay in the
[7:0] [7:0] critical path is greater than the minimum clock
validsample E E
period requirement, i.e., timing is not met
Figure 1.4: MAC with long path
• In this implementation, all multiply/add operations occurs in one clock
cycle
Timing – Add Register Layers
• To resolve the issue, we can further pipeline this design
• By adding extra registers intermediate to the multipliers
• Under the assumption that: latency requirement is not fixed at 1 clock
• We just have to add a pipeline layer between the multipliers and the
adder
Timing – Add Register Layers
module fir(
output [7:0] Y,
input [7:0] A, B, C, X,
input clk,
input validsample);
reg [7:0] X1, X2, Y;
reg [7:0] prod1, prod2, prod3;
always @ (posedge clk) begin
if(validsample) begin
X1 <= X;
Registers added
X2 <= X1;
prod1 <= A * X;
prod2 <= B * X1;
prod3 <= C * X2;
end
Y <= prod1 + prod2 + prod3;
end
endmodule
Timing – Add Register Layers
• The above implementation is described by the following figure:
[7:0]
A[7:0] Q[7:0]
[7:0] Intermediate registers added
D[7:0]
[7:0] * E
[7:0] [7:0]
C[7:0] Q[7:0] Y[7:0]
Q[7:0] + D[7:0] [7:0]
* D[7:0] [7:0] [7:0] E
clk E
Q[7:0] Q[7:0]
X[7:0] D[7:0] D[7:0] [7:0]
validsample E E
Shorter delay in the critical path
timing requirement is met
B[7:0] Q[7:0]
[7:0] [7:0] D[7:0]
* E
[7:0]
[7:0]
One multiplier delay One adder delay
Figure 1.5: Pipeline registers added
Timing – Add Register Layers
• Adding register layers improves timing by dividing the critical path into
two paths of smaller delay.
• Penalty: increase in latency and area
Timing – Parallel Structures
• Reorganizing the critical path such that logic structures are
implemented in parallel is the second strategy for architectural
timing improvement
• This technique should be used whenever:
• A function that currently evaluates through a serial string of logic
• But can be broken and evaluated in parallel
• For instance, assume that the standard pipelined power-of-3 design
discussed above does not meet timing
• To create parallel structures:
• Break up the multipliers into independent operations
• Recombine them to get results
Timing – Parallel Structures
• For example:
• To compute the square of an 8-bit binary number X, i.e., X2, we first
represent:
X A, B
where A and B are the most and the least significant nibbles, respectively
• Then, the multiply operation can be reorganized as follows:
X * X A, B * A, B A * A , 2* A * B , B * B
• By doing like this, we have reduced our problem to a series of 4-bit
multiplications, that can be handled in parallel
• Recombining the products, we get the final result for the 8-bit
multiplication
Timing – Parallel Structures
• Applying this principle, we can re-implement the power-of-3 design as
follows:
module power3(
output [7:0] XPower,
input [7:0] X,
input clk);
reg [7:0] XPower1;
// partial product registers
reg [3:0] XPower2_ppAA, XPower2_ppAB, XPower2_ppBB;
reg [3:0] XPower3_ppAA, XPower3_ppAB, XPower3_ppBB;
reg [7:0] X1, X2;
wire [7:0] XPower2;
Timing – Parallel Structures
// nibbles for partial products (A: MS nibble, B: LS nibble)
wire [3:0] XPower1_A = XPower1[7:4];
wire [3:0] XPower1_B = XPower1[3:0];
wire [3:0] X1_A = X1[7:4];
wire [3:0] X1_B = X1[3:0];
wire [3:0] XPower2_A = XPower2[7:4];
wire [3:0] XPower2_B = XPower2[3:0]; Combine to get 8-bit results
wire [3:0] X2_A = X2[7:4];
wire [3:0] X2_B = X2[3:0];
// assemble partial products
assign XPower2 = (XPower2_ppAA << 8)+(2*XPower2_ppAB<< 4)
+XPower2_ppBB;
assign XPower = (XPower3_ppAA << 8)+(2*XPower3_ppAB << 4)
+XPower3_ppBB;
Timing – Parallel Structures
always @(posedge clk) begin
// Pipeline stage 1
X1 <= X;
XPower1 <= X; 4-bit products are handled in parallel
// Pipeline stage 2
X2 <= X1;
// create partial products
XPower2_ppAA <= XPower1_A * X1_A;
XPower2_ppAB <= XPower1_A * X1_B;
XPower2_ppBB <= XPower1_B * X1_B;
// Pipeline stage 3
// create partial products
XPower3_ppAA <= XPower2_A * X2_A;
XPower3_ppAB <= XPower2_A * X2_B;
XPower3_ppBB <= XPower2_B * X2_B;
end
endmodule
Timing – Flatten Logic Structures
• Flattening logic structures is the third strategy for architectural
timing improvement
• This technique is closely related to the idea of parallel structures, but
applies specifically to logic that is chained due to priority encoding.
• Normally, synthesis and layout tools are not smart enough to break up
logic structures coded in a serial fashion
• They also do not have enough information relating to the priority
requirements of the design
• Therefore, it is possible that they are unable to give us a configuration
with our desired timing
Timing – Flatten Logic Structures
• For example:
• Consider the following control signals coming from an address decode
that are used to write 4 registers:
module regwrite(
output reg [3:0] rout,
input clk, in,
input [3:0] ctrl); Control signals are coded with a
priority compared to other signals
always @(posedge clk)
if(ctrl[0]) rout[0] <= in;
else if(ctrl[1]) rout[1] <= in;
else if(ctrl[2]) rout[2] <= in;
else if(ctrl[3]) rout[3] <= in;
endmodule
Timing – Flatten Logic Structures
• The above type of priority encoding is implemented as follows:
Combinatorial logic gates
are used due to priority
encoding, introducing delay
Figure 1.6: Priority encoding
Timing – Flatten Logic Structures
• Time delay is required for the priority logic timing is not met
• Solution is to remove the priority and thereby flatten the logic
module regwrite(
output reg [3:0] rout,
input clk, in,
input [3:0] ctrl); Priority is removed. Control
always @(posedge clk) begin signals act independently
if(ctrl[0]) rout[0] <= in;
if(ctrl[1]) rout[1] <= in;
if(ctrl[2]) rout[2] <= in;
if(ctrl[3]) rout[3] <= in;
end
endmodule
Timing – Flatten Logic Structures
clk
ctrl[3:0]
[3:0] [3]
[3] 0
in 1
[3:0]
[2] Q[7:0] rout[3:0]
[2] 0 D[7:0] [3:0]
[1]
[1] 0
1
Gates are removed. Each of the control
[0]
signals acts independently and controls
0
its corresponding rout bits independently [0]
Figure 1.7: No priority encoding
Timing – Flatten Logic Structures
• By removing priority encodings where they are not needed, the logic
structure is flattened and the path delay is reduced
Timing – Register Balancing
• Register balancing is the fourth strategy for architectural timing
improvement
• Idea of register balancing: redistribute logic evenly between
registers to minimize worse-case delay between any two registers
• This technique should be used whenever:
• Logic is highly imbalanced between the critical path and adjacent
path
• Although many synthesis tools have the so-called register balancing
optimization, they can perform just simple cases in a predetermined
fashion
• Thus, designers must have the ability to redistribute logic in their own
way to reduce worse-case delay
Timing – Register Balancing
• Consider the following code for an adder that adds three 8-bits inputs:
module adder(
output reg [7:0] Sum,
input [7:0] A, B, C,
input clk);
reg [7:0] rA, rB, rC;
always @(posedge clk) begin
rA <= A;
rB <= B;
rC <= C;
Sum <= rA + rB + rC;
end
endmodule
Timing – Register Balancing
Sum in the second stage,
determines worse-case delay
No logic in the
1st register stage
Figure 1.8: Registered adder
Timing – Register Balancing
• Some of the logic in the critical path can be moved back a stage, thereby
balancing the logic load between two register stages.
module adder(
output reg [7:0] Sum,
input [7:0] A, B, C,
input clk);
reg [7:0] rABSum, rC;
always @(posedge clk) begin
rABSum <= A + B;
rC <= C;
Sum is moved back a stage
Sum <= rABSum + rC;
end
endmodule
Timing – Register Balancing
One of the add operation is Balance the logic between states
moved back a stage and reduce delay in the critical path
Figure 1.9: Registered balanced
Timing – Reorder Paths
• Reordering the paths in the data flow to minimize the critical
path is the fifth strategy
• This technique should be used whenever:
• Multiple paths combine with the critical path
• The combined path can be reordered such that the critical path
becomes shorter
Timing – Reorder Paths
• Consider the following module:
module randomlogic(
output reg [7:0] Out,
input [7:0] A, B, C,
input clk,
input Cond1, Cond2);
always @(posedge clk)
if(Cond1)
Out <= A;
else if(Cond2 && (C < 8))
Out <= B;
else
Out <= C;
endmodule
Timing – Reorder Paths
Critical path is between C and Out: comparator in series with two gates
Figure 1.10: Long critical path
Timing – Reorder Paths
• We can modify the code to reorder the long delay of the comparator:
module randomlogic(
output reg [7:0] Out,
input [7:0] A, B, C,
input clk,
input Cond1, Cond2);
wire CondB = (Cond2 & !Cond1);
always @(posedge clk)
if(CondB && (C < 8))
Out <= B;
else if(Cond1)
Out <= A;
else
Out <= C;
endmodule
Timing – Reorder Paths
Gate is moved out the critical path
Shorter delay, meet timing requirement
Figure 1.11: Logic reordered to reduce critical path
5. Summary
Summary
• A high-throughput architecture: maximizes the number of
bits per second that can be processed by a design
• Unrolling an iterative loop increases throughput
• Penalty: a proportional increase in area
• A low-latency architecture: minimizes the delay from the input
of a module to the output
• Latency can be reduced by removing pipeline registers
• Penalty: an increase in combinatorial delay between registers
• Timing refers to the clock speed of a design
Summary
• A design meets timing: maximum delay between any two
sequential elements < minimum clock period
• Timing can be improved by:
• Add register layers
• Parallel structures
• Flatten logic structures
• Register Balancing
• Reorder path