0% found this document useful (0 votes)

129 views52 pages

Advanced FPGA Design Techniques

This document provides an overview of techniques for architecting digital designs to improve speed, including high throughput, low latency, and timing optimizations. It discusses how adding pipeline registers can increase throughput at the cost of latency, and how removing registers can reduce latency but increase critical path delays. Optimizing timing may involve adding register layers to break up long combinational paths. The document uses examples of exponentiation and FIR filters to illustrate these techniques.

Uploaded by

ifire

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

129 views52 pages

Advanced FPGA Design Techniques

Uploaded by

ifire

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

WiCHIP Technologies

Engineering your ideas

Advanced FPGA Design
Architecturing Speed
Vu-Duc Ngo
[email protected]
Contents
• 1. Introduction to Architecting Speed
• 2. High Throughput
• 3. Low Latency
• 4. Timing
• 5. Summary
1. Introduction to Architecting Speed
Introduction to Architecting Speed
• When dealing with digital designs, optimizations are required
to meet design constraints.
• One of three primary physical characteristics of a digital design
is: speed
• Three primary definitions of speed:
• Throughput: amount of data processed / clock cycle (bits/s)
• Latency: time between data input – processed data output (time or
clock cycles)
• Timing: logic delays between sequential elements (clock period,
frequency)
Introduction to Architecting Speed
• To improve speed, we need different architectures:
• High-throughput architectures: maximizing number of processed
bits/second
• Low-latency architectures: minimizing delay from input  output
• Timing optimizations: reducing combinatorial delay of the critical
path
• Techniques for timing optimization:
• Adding register layers
• Using parallel structures
• Flattening logic structures
• Balancing registers
• Reordering paths
2. High Throughput
High Throughput
• High-throughput design is concerned:
• More about the steady-state data rate
• Less about latency (i.e., the time an specific piece of data requires to
propagate through the design)
• High-throughput design employs the so-called pipeline
architecture
• New data can begin processing before the prior data has finished 
improving system throughput
• Let us consider some examples to illustrate the beauty of
pipeline architecture
High Throughput
• Consider the following piece of code:
XPower = 1;
for (i=0; i < 3; i++)
XPower = X * XPower;

• This code is an iterative algorithm:

• Same variables and addresses are accessed until the completion of
computation
• We can also implement the above code in hardware using
Verilog as follows:
High Throughput
module power3(
output [7:0] XPower,
output finished,
input [7:0] X,
input clk, start); // the duration of start is a single clock
reg [7:0] ncount;
reg [7:0] XPower;
assign finished = (ncount == 0);
always@(posedge clk)
if(start) begin
XPower <= X;
ncount <= 2;
end
else if(!finished) begin
ncount <= ncount - 1;
XPower <= XPower * X;
end
endmodule
High Throughput
Same register and computational resources are used until the completion of computation

clk [7:0]

start
* [7:0]
[7:0]
0
[7:0] D[7:0] Q7:0] [7:0]
1 [7:0] E
X[7:0]
XPower[7:0]
[7:0]

Figure 1.1: Iterative Implementation

• No new computations can begin until the previous computation has

completed
High Throughput
• Handshaking signals are required: start, finished
• Performance of the above Verilog implementation:
• Through put = 8 bits/3 clocks = 2.7 bits/clock
• Latency = 3 clocks
• Timing = One multiplier delay in the critical path

• Let us have a look at a pipelined version of the same algorithm

High Throughput
module power3(
output reg [7:0] XPower,
input clk,
input [7:0] X
);
reg [7:0] XPower1, XPower2;
reg [7:0] X1, X2;
always @(posedge clk) begin
// Pipeline stage 1
X1 <= X;
XPower1 <= X;
// Pipeline stage 2
X2 <= X1;
XPower2 <= XPower1 * X1;
// Pipeline stage 3
XPower <= XPower2 * X2;
end
endmodule
High Throughput
• Pipelined implementation can be described by the following figure:
One multiplier delay in critical path

Q[7:0]
[7:0] Q[7:0] XPower[7:0]
[7:0]
[7:0]
clk Q[7:0] D[7:0] * D[7:0]
* [7:0]
X[7:0] D[7:0] [7:0] [7:0]
[7:0] Q[7:0]

D[7:0] 3rd clock: output X3

[7:0]
1st clock: compute X2
2nd clock: compute X*X2
2nd clock: compute next X2
3rd clock: compute next X*X2
3rd clock: compute next next X2

Figure 1.2: Pipelined Implementation

High Throughput
• Performance of the pipelined implementation:
• Throughput = 8 /1 or 8 bits/clock
• Latency = 3 clocks
• Timing = One multiplier delay in the critical path

• Clearly, this implementation has improved throughput by a factor of 3

• Unrolling an iterative loop increases throughput

• In general, unrolling an iteration loop, which has n loops, via the use of
pipelined implementation makes throughput increase by a factor of n

• Penalty: increase in area (more registers, more multipliers)

3. Low Latency
Low latency
• Low-latency design: passes data from the input to the output as quickly
as possible
• To achieve the goal, minimizing intermediate processing delays
• Often, a low-latency design requires:
• Parallelisms
• Removal of pipelining
• Logical short cuts
• Observing the previous pipelined implementation, we can see the
possibility of reducing latency:
• At each pipeline stage, the product of each multiply must wait for the
next clock edge to propagate to the next stage
• Therefore, we expect that removing pipeline registers  reducing
latency
Low latency
module power3(
output [7:0] XPower,
input [7:0] X
);
reg [7:0] XPower1, XPower2;
reg [7:0] X1, X2;
assign XPower = XPower2 * X2;
always @* begin
X1 = X;
XPower1 = X;
end
always @* begin
X2 = X1;
XPower2 = XPower1*X1;
end
endmodule
Low latency
• Low-latency implementation is illustrated as follows:

Two multiplier delays in the critical path

[7:0]
[7:0]
[7:0] XPower[7:0]
*
X[7:0]
[7:0] * [7:0]

Removal of pipelining:
registers no longer exist

Figure 1.3: Low-latency Implementation

Low latency
• Performance of this low-latency design:
• Throughput = 8 bits/clock (assuming one new input per clock)
• Latency = Between one and two multiplier delays, 0 clocks
• Timing = Two multiplier delays in the critical path

• Latency can be reduced by removing pipeline registers

• Penalty: increase in combinatorial delay

4. Timing
Timing
• Timing refers to the clock speed of a design
• Maximum delay between any two sequential elements in a design will
determine the max clock speed.
• Maximum speed, or maximum frequency, is defined as follows:
1
Fmax 
Tclk-q  Tlogic  Trouting  Tsetup  Tskew
Fmax : maximum allowable frequency for clock
• where
Tclk-q : time from clock arrival until data arrives at Q
Tlogic : propagation delay through logic between flipflops
Trouting : routing delay between flipflops
Tsetup : setup time
Tskew : propagation delay of clock
Timing – Add Register Layers
• Adding intermediate layers of registers to the critical path is the
first strategy for architectural timing improvement
• This technique should be used in highly pipelined designs, where:
• An additional clock cycle latency does not violate the design
specifications
• The overall functionality will not be affected by the further addition of
registers

• For example: assume the architecture for the following FIR

implementation does not meet timing:
Timing – Add Register Layers
module fir(
output [7:0] Y,
input [7:0] A, B, C, X,
input clk,
input validsample);
reg [7:0] X1, X2, Y;
always @(posedge clk)
if(validsample) begin
X1 <= X;
X2 <= X1;
Y <= A* X+B* X1+C* X2;
end
endmodule
Timing – Add Register Layers
• The above implementation is described by the following figure:
[7:0] [7:0]
A[7:0]

[7:0] * [7:0]
[7:0] [7:0]
B[7:0] Q[7:0] Y[7:0]
[7:0]
[7:0] * + D[7:0]
[7:0] E
C[7:0]

[7:0] [7:0]
*
clk Q[7:0] Q[7:0]
X[7:0] D[7:0] D[7:0]
[7:0] One adder delay + one multiplier delay in the
[7:0] [7:0] critical path is greater than the minimum clock
validsample E E
period requirement, i.e., timing is not met

Figure 1.4: MAC with long path

• In this implementation, all multiply/add operations occurs in one clock
cycle
Timing – Add Register Layers

• To resolve the issue, we can further pipeline this design

• By adding extra registers intermediate to the multipliers
• Under the assumption that: latency requirement is not fixed at 1 clock

• We just have to add a pipeline layer between the multipliers and the
adder
Timing – Add Register Layers
module fir(
output [7:0] Y,
input [7:0] A, B, C, X,
input clk,
input validsample);
reg [7:0] X1, X2, Y;
reg [7:0] prod1, prod2, prod3;
always @ (posedge clk) begin
if(validsample) begin
X1 <= X;
Registers added
X2 <= X1;
prod1 <= A * X;
prod2 <= B * X1;
prod3 <= C * X2;
end
Y <= prod1 + prod2 + prod3;
end
endmodule
Timing – Add Register Layers
• The above implementation is described by the following figure:
[7:0]
A[7:0] Q[7:0]
[7:0] Intermediate registers added
D[7:0]
[7:0] * E

[7:0] [7:0]
C[7:0] Q[7:0] Y[7:0]
Q[7:0] + D[7:0] [7:0]
* D[7:0] [7:0] [7:0] E
clk E
Q[7:0] Q[7:0]
X[7:0] D[7:0] D[7:0] [7:0]
validsample E E
Shorter delay in the critical path
 timing requirement is met
B[7:0] Q[7:0]
[7:0] [7:0] D[7:0]
* E
[7:0]

[7:0]
One multiplier delay One adder delay

Figure 1.5: Pipeline registers added

Timing – Add Register Layers

• Adding register layers improves timing by dividing the critical path into
two paths of smaller delay.

• Penalty: increase in latency and area

Timing – Parallel Structures
• Reorganizing the critical path such that logic structures are
implemented in parallel is the second strategy for architectural
timing improvement
• This technique should be used whenever:
• A function that currently evaluates through a serial string of logic
• But can be broken and evaluated in parallel

• For instance, assume that the standard pipelined power-of-3 design

discussed above does not meet timing
• To create parallel structures:
• Break up the multipliers into independent operations
• Recombine them to get results
Timing – Parallel Structures
• For example:
• To compute the square of an 8-bit binary number X, i.e., X2, we first
represent:
X   A, B
where A and B are the most and the least significant nibbles, respectively
• Then, the multiply operation can be reorganized as follows:

X * X   A, B * A, B   A * A  ,  2* A * B  ,  B * B 
• By doing like this, we have reduced our problem to a series of 4-bit
multiplications, that can be handled in parallel
• Recombining the products, we get the final result for the 8-bit
multiplication
Timing – Parallel Structures
• Applying this principle, we can re-implement the power-of-3 design as
follows:

module power3(
output [7:0] XPower,
input [7:0] X,
input clk);
reg [7:0] XPower1;
// partial product registers
reg [3:0] XPower2_ppAA, XPower2_ppAB, XPower2_ppBB;
reg [3:0] XPower3_ppAA, XPower3_ppAB, XPower3_ppBB;
reg [7:0] X1, X2;
wire [7:0] XPower2;
Timing – Parallel Structures
// nibbles for partial products (A: MS nibble, B: LS nibble)
wire [3:0] XPower1_A = XPower1[7:4];
wire [3:0] XPower1_B = XPower1[3:0];
wire [3:0] X1_A = X1[7:4];
wire [3:0] X1_B = X1[3:0];
wire [3:0] XPower2_A = XPower2[7:4];
wire [3:0] XPower2_B = XPower2[3:0]; Combine to get 8-bit results
wire [3:0] X2_A = X2[7:4];
wire [3:0] X2_B = X2[3:0];
// assemble partial products
assign XPower2 = (XPower2_ppAA << 8)+(2*XPower2_ppAB<< 4)
+XPower2_ppBB;
assign XPower = (XPower3_ppAA << 8)+(2*XPower3_ppAB << 4)
+XPower3_ppBB;
Timing – Parallel Structures
always @(posedge clk) begin
// Pipeline stage 1
X1 <= X;
XPower1 <= X; 4-bit products are handled in parallel
// Pipeline stage 2
X2 <= X1;
// create partial products
XPower2_ppAA <= XPower1_A * X1_A;
XPower2_ppAB <= XPower1_A * X1_B;
XPower2_ppBB <= XPower1_B * X1_B;
// Pipeline stage 3
// create partial products
XPower3_ppAA <= XPower2_A * X2_A;
XPower3_ppAB <= XPower2_A * X2_B;
XPower3_ppBB <= XPower2_B * X2_B;
end
endmodule
Timing – Flatten Logic Structures
• Flattening logic structures is the third strategy for architectural
timing improvement
• This technique is closely related to the idea of parallel structures, but
applies specifically to logic that is chained due to priority encoding.

• Normally, synthesis and layout tools are not smart enough to break up
logic structures coded in a serial fashion
• They also do not have enough information relating to the priority
requirements of the design
• Therefore, it is possible that they are unable to give us a configuration
with our desired timing
Timing – Flatten Logic Structures
• For example:
• Consider the following control signals coming from an address decode
that are used to write 4 registers:
module regwrite(
output reg [3:0] rout,
input clk, in,
input [3:0] ctrl); Control signals are coded with a
priority compared to other signals
always @(posedge clk)
if(ctrl[0]) rout[0] <= in;
else if(ctrl[1]) rout[1] <= in;
else if(ctrl[2]) rout[2] <= in;
else if(ctrl[3]) rout[3] <= in;
endmodule
Timing – Flatten Logic Structures
• The above type of priority encoding is implemented as follows:

Combinatorial logic gates

are used due to priority
encoding,  introducing delay

Figure 1.6: Priority encoding

Timing – Flatten Logic Structures
• Time delay is required for the priority logic  timing is not met
• Solution is to remove the priority and thereby flatten the logic

module regwrite(
output reg [3:0] rout,
input clk, in,
input [3:0] ctrl); Priority is removed. Control
always @(posedge clk) begin signals act independently
if(ctrl[0]) rout[0] <= in;
if(ctrl[1]) rout[1] <= in;
if(ctrl[2]) rout[2] <= in;
if(ctrl[3]) rout[3] <= in;
end
endmodule
Timing – Flatten Logic Structures
clk

ctrl[3:0]
[3:0] [3]
[3] 0

in 1
[3:0]
[2] Q[7:0] rout[3:0]
[2] 0 D[7:0] [3:0]

[1]
[1] 0

1
Gates are removed. Each of the control
[0]
signals acts independently and controls
0
its corresponding rout bits independently [0]

Figure 1.7: No priority encoding

Timing – Flatten Logic Structures

• By removing priority encodings where they are not needed, the logic
structure is flattened and the path delay is reduced
Timing – Register Balancing
• Register balancing is the fourth strategy for architectural timing
improvement
• Idea of register balancing: redistribute logic evenly between
registers to minimize worse-case delay between any two registers
• This technique should be used whenever:
• Logic is highly imbalanced between the critical path and adjacent
path

• Although many synthesis tools have the so-called register balancing

optimization, they can perform just simple cases in a predetermined
fashion
• Thus, designers must have the ability to redistribute logic in their own
way to reduce worse-case delay
Timing – Register Balancing
• Consider the following code for an adder that adds three 8-bits inputs:

module adder(
output reg [7:0] Sum,
input [7:0] A, B, C,
input clk);
reg [7:0] rA, rB, rC;
always @(posedge clk) begin
rA <= A;
rB <= B;
rC <= C;
Sum <= rA + rB + rC;
end
endmodule
Timing – Register Balancing

Sum in the second stage,

determines worse-case delay

No logic in the
1st register stage
Figure 1.8: Registered adder
Timing – Register Balancing
• Some of the logic in the critical path can be moved back a stage, thereby
balancing the logic load between two register stages.

module adder(
output reg [7:0] Sum,
input [7:0] A, B, C,
input clk);
reg [7:0] rABSum, rC;
always @(posedge clk) begin
rABSum <= A + B;
rC <= C;
Sum is moved back a stage
Sum <= rABSum + rC;
end
endmodule
Timing – Register Balancing

One of the add operation is Balance the logic between states

moved back a stage and reduce delay in the critical path

Figure 1.9: Registered balanced

Timing – Reorder Paths
• Reordering the paths in the data flow to minimize the critical
path is the fifth strategy

• This technique should be used whenever:

• Multiple paths combine with the critical path
• The combined path can be reordered such that the critical path
becomes shorter
Timing – Reorder Paths
• Consider the following module:

module randomlogic(
output reg [7:0] Out,
input [7:0] A, B, C,
input clk,
input Cond1, Cond2);
always @(posedge clk)
if(Cond1)
Out <= A;
else if(Cond2 && (C < 8))
Out <= B;
else
Out <= C;
endmodule
Timing – Reorder Paths

Critical path is between C and Out: comparator in series with two gates

Figure 1.10: Long critical path

Timing – Reorder Paths
• We can modify the code to reorder the long delay of the comparator:

module randomlogic(
output reg [7:0] Out,
input [7:0] A, B, C,
input clk,
input Cond1, Cond2);
wire CondB = (Cond2 & !Cond1);
always @(posedge clk)
if(CondB && (C < 8))
Out <= B;
else if(Cond1)
Out <= A;
else
Out <= C;
endmodule
Timing – Reorder Paths

Gate is moved out the critical path

Shorter delay, meet timing requirement

Figure 1.11: Logic reordered to reduce critical path

5. Summary
Summary
• A high-throughput architecture: maximizes the number of
bits per second that can be processed by a design
• Unrolling an iterative loop increases throughput
• Penalty: a proportional increase in area
• A low-latency architecture: minimizes the delay from the input
of a module to the output
• Latency can be reduced by removing pipeline registers
• Penalty: an increase in combinatorial delay between registers
• Timing refers to the clock speed of a design
Summary
• A design meets timing: maximum delay between any two
sequential elements < minimum clock period
• Timing can be improved by:
• Add register layers
• Parallel structures
• Flatten logic structures
• Register Balancing
• Reorder path

ADSD Fall2011 05 Architect Ing Speed 2011nov03
No ratings yet
ADSD Fall2011 05 Architect Ing Speed 2011nov03
96 pages
W13L18 - Real Time System Design - 1
No ratings yet
W13L18 - Real Time System Design - 1
29 pages
High Level Synthesis II: ECE 3401 Digital Systems Design
No ratings yet
High Level Synthesis II: ECE 3401 Digital Systems Design
35 pages
Chapter4 2
No ratings yet
Chapter4 2
34 pages
Chap 1
No ratings yet
Chap 1
12 pages
Digital Filtering in Hardware: Adnan Aziz
No ratings yet
Digital Filtering in Hardware: Adnan Aziz
102 pages
Abbas Mustafa S 201911 MSC Thesis
No ratings yet
Abbas Mustafa S 201911 MSC Thesis
59 pages
Lecture 32 Pipelined Execution Structural and Data Hazards
No ratings yet
Lecture 32 Pipelined Execution Structural and Data Hazards
30 pages
CPU Pipeline Architecture Guide
No ratings yet
CPU Pipeline Architecture Guide
26 pages
High-Speed Digital Architectures: Chris Allen (Callen@eecs - Ku.edu)
No ratings yet
High-Speed Digital Architectures: Chris Allen (Callen@eecs - Ku.edu)
30 pages
Course Script HWSW Codesign
No ratings yet
Course Script HWSW Codesign
144 pages
Hazards Slideshow
No ratings yet
Hazards Slideshow
72 pages
Wayne Luk Imperial College March 2002: Wl@doc - Ic.ac - Uk
No ratings yet
Wayne Luk Imperial College March 2002: Wl@doc - Ic.ac - Uk
15 pages
VLSI Low Power High Speed Design Techniques
No ratings yet
VLSI Low Power High Speed Design Techniques
8 pages
CA07 2022S3 New
No ratings yet
CA07 2022S3 New
29 pages
Notes Day9
No ratings yet
Notes Day9
13 pages
Prep Asic
No ratings yet
Prep Asic
36 pages
Module 2
No ratings yet
Module 2
127 pages
04 Pipeline
No ratings yet
04 Pipeline
83 pages
08 Architecture
No ratings yet
08 Architecture
51 pages
Advanced Pipelining Techniques
No ratings yet
Advanced Pipelining Techniques
75 pages
Pipelining & Superscalar Techniques
No ratings yet
Pipelining & Superscalar Techniques
71 pages
Bản Sao Của Lecture 9 - Pipelined Processor Design
No ratings yet
Bản Sao Của Lecture 9 - Pipelined Processor Design
11 pages
Lecture 5a Pipelining
No ratings yet
Lecture 5a Pipelining
21 pages
Design of 32bit MIPS Processor
No ratings yet
Design of 32bit MIPS Processor
23 pages
W1M3 HLS ProblemFormulations
No ratings yet
W1M3 HLS ProblemFormulations
25 pages
CAO Fall 2024 Lecture 06 Design Metrics Performance Evaluation
No ratings yet
CAO Fall 2024 Lecture 06 Design Metrics Performance Evaluation
41 pages
Pipeling and Parallel Processing: Critical Path - The Minimum Time Required For Processing One Sample
No ratings yet
Pipeling and Parallel Processing: Critical Path - The Minimum Time Required For Processing One Sample
13 pages
Pipelining and Parallel Processing Techniques
No ratings yet
Pipelining and Parallel Processing Techniques
53 pages
CPU Design and Program Counter
No ratings yet
CPU Design and Program Counter
11 pages
Lec04 Advanced Sequential Circuit Design
No ratings yet
Lec04 Advanced Sequential Circuit Design
54 pages
Pipelining: Why Wait - . - ?
No ratings yet
Pipelining: Why Wait - . - ?
27 pages
Jangam Abhinav Lab 04 Report
No ratings yet
Jangam Abhinav Lab 04 Report
15 pages
CA Slides#3 Pipeline Introduction
No ratings yet
CA Slides#3 Pipeline Introduction
26 pages
Unit 3
No ratings yet
Unit 3
64 pages
COA CH 6
No ratings yet
COA CH 6
14 pages
Week 11
No ratings yet
Week 11
33 pages
Advanced Pipelining Techniques
No ratings yet
Advanced Pipelining Techniques
44 pages
Time and Area Optimization in Processor Architectu
No ratings yet
Time and Area Optimization in Processor Architectu
12 pages
Pipe Lining
No ratings yet
Pipe Lining
66 pages
Project 2011 Spring VLSI
No ratings yet
Project 2011 Spring VLSI
14 pages
Pipelining and Parallel Processing Techniques
No ratings yet
Pipelining and Parallel Processing Techniques
53 pages
Instruction Formats and Control Units
No ratings yet
Instruction Formats and Control Units
63 pages
Parallelism in Uniprocessor System and Granularity
100% (5)
Parallelism in Uniprocessor System and Granularity
5 pages
Pipelining & Retiming of Digital Circuits
No ratings yet
Pipelining & Retiming of Digital Circuits
47 pages
Asynchronous Circuit Design Guide
No ratings yet
Asynchronous Circuit Design Guide
38 pages
VLSI Timing for Computer Architects
No ratings yet
VLSI Timing for Computer Architects
10 pages
Project 2011 Fall VLSI
No ratings yet
Project 2011 Fall VLSI
14 pages
05 RISCV ISA Implementation MC
No ratings yet
05 RISCV ISA Implementation MC
199 pages
COA DR MVN 5 UNIT - Latest PDF
No ratings yet
COA DR MVN 5 UNIT - Latest PDF
24 pages
Lect02.LecJan12 2006.PipelineProcessor
No ratings yet
Lect02.LecJan12 2006.PipelineProcessor
34 pages
Parallel Computing Insights
No ratings yet
Parallel Computing Insights
47 pages
Computer Architecture: Edited by Galatro Giovanni
No ratings yet
Computer Architecture: Edited by Galatro Giovanni
34 pages
Unit 3 - Advanced Computer Architecture - WWW - Rgpvnotes.in
No ratings yet
Unit 3 - Advanced Computer Architecture - WWW - Rgpvnotes.in
15 pages
Pipelining Techniques in DLX Architecture
No ratings yet
Pipelining Techniques in DLX Architecture
39 pages
FPGA and ASIC Design Flow Overview
No ratings yet
FPGA and ASIC Design Flow Overview
18 pages
ET5080E Digital Design Using Verilog HDL: Fall 21
No ratings yet
ET5080E Digital Design Using Verilog HDL: Fall 21
35 pages
Lecture7 Fall 21
No ratings yet
Lecture7 Fall 21
53 pages
Review ICC
No ratings yet
Review ICC
3 pages
Digital Design Using Verilog HDL: Fall 21
No ratings yet
Digital Design Using Verilog HDL: Fall 21
39 pages
Hight Throughput FPGA Implementation of Low-Complexity Detector For High-Rate Spatial Modulation
No ratings yet
Hight Throughput FPGA Implementation of Low-Complexity Detector For High-Rate Spatial Modulation
6 pages
Project Name: Thief Detector Circuit: Week 9: PCB Layout Design
No ratings yet
Project Name: Thief Detector Circuit: Week 9: PCB Layout Design
3 pages
Project Report: Iot Based Attendance Marking System
No ratings yet
Project Report: Iot Based Attendance Marking System
13 pages
Audio Amplifier Circuit Report
No ratings yet
Audio Amplifier Circuit Report
28 pages
Recursive Algorithms and Procedures Guide
No ratings yet
Recursive Algorithms and Procedures Guide
38 pages
LTE MIMO Closed Slot Antenna System For Laptops With A Metal Cover
No ratings yet
LTE MIMO Closed Slot Antenna System For Laptops With A Metal Cover
9 pages
Problem Set 5
No ratings yet
Problem Set 5
2 pages
LinkedIn Employer Brand Playbook
No ratings yet
LinkedIn Employer Brand Playbook
51 pages
My20 Corolla Hibrid HB I Ts
No ratings yet
My20 Corolla Hibrid HB I Ts
2 pages
Generator Protection
No ratings yet
Generator Protection
33 pages
Company Profile - Alita 2025
No ratings yet
Company Profile - Alita 2025
8 pages
Business Analyst Roadmap: With Learning Resources
No ratings yet
Business Analyst Roadmap: With Learning Resources
6 pages
Guidelines Implementing Aws Waf
No ratings yet
Guidelines Implementing Aws Waf
35 pages
Data Warehouse 1
No ratings yet
Data Warehouse 1
6 pages
AFEO Bulletin 13
No ratings yet
AFEO Bulletin 13
42 pages
Star Delta Type 2
No ratings yet
Star Delta Type 2
1 page
10x10 Tent
No ratings yet
10x10 Tent
4 pages
D (Instrument Panel) (LHD) : R/B No.3
No ratings yet
D (Instrument Panel) (LHD) : R/B No.3
1 page
STS 100 Midterm Exam Specifications
No ratings yet
STS 100 Midterm Exam Specifications
1 page
BU23-0241 - Perimeter Lighting, Zone 4 Camp Buehring, Kuwait
No ratings yet
BU23-0241 - Perimeter Lighting, Zone 4 Camp Buehring, Kuwait
2 pages
Seating Arrangement Questions For SSC CGL
100% (1)
Seating Arrangement Questions For SSC CGL
10 pages
Autonomous Mobile Robot With Web Optimization
No ratings yet
Autonomous Mobile Robot With Web Optimization
33 pages
NIOEC PLC Specification Document
No ratings yet
NIOEC PLC Specification Document
19 pages
3 s2.0 B9780443161582010028 Main
No ratings yet
3 s2.0 B9780443161582010028 Main
3 pages
Longline Buoy - EN
No ratings yet
Longline Buoy - EN
2 pages
4 EBCS 4-Design of Composite Steel & Concrete Structures PDF
100% (2)
4 EBCS 4-Design of Composite Steel & Concrete Structures PDF
137 pages
Drug Dynamics
No ratings yet
Drug Dynamics
24 pages
CV Shveta
No ratings yet
CV Shveta
1 page
Why Digital Marketers Buy Old Gmail Accounts (And You Should Too
No ratings yet
Why Digital Marketers Buy Old Gmail Accounts (And You Should Too
11 pages
SOC 2 Compliance Checklist Guide
No ratings yet
SOC 2 Compliance Checklist Guide
2 pages
Bulk Carrier Vessel Specs & Parts
No ratings yet
Bulk Carrier Vessel Specs & Parts
3 pages
Demo 3
No ratings yet
Demo 3
4 pages
அகத்தியர் நாகமுனிவர் நயனவிதி
89% (18)
அகத்தியர் நாகமுனிவர் நயனவிதி
116 pages
M11.04 Air Conditioning and Cabin ATA 2136
100% (1)
M11.04 Air Conditioning and Cabin ATA 2136
139 pages
Duitsayang
No ratings yet
Duitsayang
36 pages
Straight Through Processing
100% (1)
Straight Through Processing
12 pages
Design of Ultra Filtration Membrane
No ratings yet
Design of Ultra Filtration Membrane
3 pages

Advanced FPGA Design Techniques

Uploaded by

Advanced FPGA Design Techniques

Uploaded by

WiCHIP Technologies

Engineering your ideas

• This code is an iterative algorithm:

Figure 1.1: Iterative Implementation

• No new computations can begin until the previous computation has

• Let us have a look at a pipelined version of the same algorithm

D[7:0] 3rd clock: output X3

Figure 1.2: Pipelined Implementation

• Clearly, this implementation has improved throughput by a factor of 3

• Unrolling an iterative loop increases throughput

• Penalty: increase in area (more registers, more multipliers)

Two multiplier delays in the critical path

Figure 1.3: Low-latency Implementation

• Latency can be reduced by removing pipeline registers

• Penalty: increase in combinatorial delay

• For example: assume the architecture for the following FIR

Figure 1.4: MAC with long path

• To resolve the issue, we can further pipeline this design

Figure 1.5: Pipeline registers added

• Penalty: increase in latency and area

• For instance, assume that the standard pipelined power-of-3 design

Combinatorial logic gates

Figure 1.6: Priority encoding

Figure 1.7: No priority encoding

• Although many synthesis tools have the so-called register balancing

Sum in the second stage,

One of the add operation is Balance the logic between states

Figure 1.9: Registered balanced

• This technique should be used whenever:

Figure 1.10: Long critical path

Gate is moved out the critical path

Shorter delay, meet timing requirement

Figure 1.11: Logic reordered to reduce critical path

You might also like