Contents
Timing optimization
Area optimization
Additional readings
Budapest University of Technology and Economics
RTL Optimization Techniques
Pter Horvth
Department of Electron Devices
August 7, 2014
Pter Horvth
RTL Optimization Techniques
1 / 20
Contents
Timing optimization
Area optimization
Additional readings
Contents
Contents
timing optimization concepts and design techniques
throughput, latency, local datapath delay
loop unrolling, removing pipeline registers, register balancing
area optimization concepts and design techniques
resource requirement metrics in standard cell ASIC and FPGA
control-based logic reuse, priority encoders, considering technology
primitives
additional readings
Pter Horvth
RTL Optimization Techniques
2 / 20
Contents
Timing optimization
Area optimization
Additional readings
Timing optimization
Pter Horvth
RTL Optimization Techniques
3 / 20
Contents
Timing optimization
Area optimization
Additional readings
Computation performance concepts
Computation performance concepts
There are three important concepts related to the computation
performance.
throughput: The amount of data processed in a single clock cycle
(bits per second).
latency: The time elapsed between data input and processed data
output (clock cycles).
local datapath delays: Delay of logic between storage elements
(nanoseconds). It determines the maximum clock frequency.
Pter Horvth
RTL Optimization Techniques
4 / 20
Contents
Timing optimization
Area optimization
Additional readings
Timing optimization techniques
High throughput loop unrolling (pipeline)
During the high throughput optimization the time required for
processing of a single data is irrelevant but the time elapsed
between two input reads is minimized.
Data n+1 is read while data n is still under processing.
architecture iterative of pow3 is
begin
process (clk)
begin
if (rising_edge(clk)) then
if (start = '1') then
count <= 2;
pow <= x;
elsif (stop = '0') then
count <= count - 1;
pow <= pow * x;
end if;
end if;
end process;
stop <= '1' when count = 0 else '0';
end architecture;
architecture pipelined of pow3 is
begin
process (clk)
begin
if (rising_edge(clk)) then
-- stage 1
x1 <= x;
-- stage 2
x2 <= x1;
pow1 <= x1 * x1;
-- stage 3
pow <= pow1 * x2;
end if;
end process;
end architecture;
throuhgput: 8/1 = 8 bits/cycle; latency: 3 cycles
throuhgput: 8/3 = 2.7 bits/cycle; latency: 3 cycles
Pter Horvth
RTL Optimization Techniques
5 / 20
Contents
Timing optimization
Area optimization
Additional readings
Timing optimization techniques
High throughput loop unrolling (pipeline)
x[31:0]
32
clk
x1
x[31:0]
32
32
32
clk
start
x2
32
0
32
32
32
clk
32
pow1
32
clk
pow
32
32
pow[31:0]
32
clk
pow
throughput: 8/3 = 2.7 bits/cycle;
latency: 3 cycles
32
pow[31:0]
throughput: 8/1 = 8 bits/cycle;
latency: 3 cycles
Pter Horvth
RTL Optimization Techniques
6 / 20
Contents
Timing optimization
Area optimization
Additional readings
Timing optimization techniques
Low latency removing pipeline registers
The objective of the low-latency optimization is to pass the data
from the input to the output with minimal internal processing
delay.
A low-latency design uses parallelism and removes pipeline registers.
architecture async of pow3 is
begin
process (x)
begin
x1 <= x;
architecture pipelined of pow3 is
begin
process (clk)
begin
if (rising_edge(clk)) then
-- stage 1
x1 <= x;
end process;
process (x1)
begin
x2 <= x1;
pow1 <= x1 * x1;
end process;
-- stage 2
x2 <= x1;
pow1 <= x1 * x1;
-- stage 3
pow <= pow1 * x2;
end if;
end process;
end architecture;
pow <= pow1 * x2;
end architecture;
latency: 1 cycles (with an additional output register)
latency: 3 cycles
Pter Horvth
RTL Optimization Techniques
7 / 20
Contents
Timing optimization
Area optimization
Additional readings
Timing optimization techniques
Low latency removing pipeline registers
x[31:0]
32
clk
x1
x[31:0]
32
32
32
32
32
clk
x2
32
32
clk
32
pow1
32
32
clk
pow
32
32
clk
pow[31:0]
pow
32
latency: 1 cycles
pow[31:0]
latency: 3 cycles
Pter Horvth
RTL Optimization Techniques
8 / 20
Contents
Timing optimization
Area optimization
Additional readings
Timing optimization techniques
Minimizing logic delay register layers
The logic between two sequential elements is called local datapath.
The delay of the slowest local datapath determines the maximum
clock frequency.
The local datapath delay can be reduced by additional register
layers.
architecture single_cycle of fir is
begin
process (clk)
begin
if (rising_edge(clk)) then
if (valid = '1') then
x1 <= x;
x2 <= x1;
y <= A*x + B*x1 + C*x2;
end if;
end if;
end process;
end architecture;
Pter Horvth
architecture multi_cycle of fir is
begin
process (clk)
begin
if (rising_edge(clk)) then
if (valid = '1') then
x1 <= x; x2 <= x1;
prod1 <= A * x;
prod2 <= B * x1;
prod3 <= C * x2;
end if;
end if;
end process;
y <= prod1 + prod2 + prod3;
end architecture;
RTL Optimization Techniques
9 / 20
Contents
Timing optimization
Area optimization
Additional readings
Timing optimization techniques
Minimizing logic delay register layers
x[31:0]
x[31:0]
32
A[31:0]
32
32
32
clk
B[31:0]
B[31:0]
32
32
clk
32
32
clk
x2
x1
x1
A[31:0]
clk
x2
C
32
32
C[31:0]
32
32
32
32
32
clk
32
prod3
32
clk
clk
prod2
32
prod1
32
32
32
clk
32
clk
y
32
y[31:0]
32
y[31:0]
local datapaths: 1 adder and 1
multiplier
Pter Horvth
local datapaths: 1 adder or 1
multiplier
RTL Optimization Techniques
10 / 20
Contents
Timing optimization
Area optimization
Additional readings
Timing optimization techniques
Minimizing logic delay register balancing
During register balancing the logic between registers is redistributed
in order to minimize the worst-case delay between any register pairs.
architecture not_balanced of add3 is
begin
process (clk)
begin
if (rising_edge(clk)) then
reg_a <= in_a;
reg_b <= in_b;
reg_c <= in_c;
sum <= reg_a + reg_b + reg_c;
end if;
end process;
end architecture;
Pter Horvth
architecture balanced of add3 is
begin
process (clk)
begin
if (rising_edge(clk)) then
reg_ab_sum <= in_a + in_b;
reg_c <= in_c;
sum <= reg_ab_sum + reg_c;
end if;
end process;
end architecture;
RTL Optimization Techniques
11 / 20
Contents
Timing optimization
Area optimization
Additional readings
Timing optimization techniques
Minimizing logic delay register balancing
in_a[31:0]
in_b[31:0]
32
clk
reg_b
32
clk
clk
reg_ab_sum
reg_c
32
32
32
in_c[31:0]
32
reg_b
32
32
in_b[31:0]
32
clk
reg_a
32
in_a[31:0]
in_b[31:0]
32
32
clk
+
32
32
clk
clk
sum
sum
32
32
sum[31:0]
local datapaths: 2 adders
sum[31:0]
local datapaths: 1 adder
Pter Horvth
RTL Optimization Techniques
12 / 20
Contents
Timing optimization
Area optimization
Additional readings
Area optimization
Pter Horvth
RTL Optimization Techniques
13 / 20
Contents
Timing optimization
Area optimization
Additional readings
Area concepts
Area concepts
The resource requirement means the amount of the basic functional
primitives required for implementing the described functionality.
The basic functional primitives in standard cell ASICs are the
standard cells, which can be simple logic gates, flip-flops but also
more complex arithmetic-logic functions or memories.
The basic logic elements (BLE) of an FPGA consists of a logic
function (the input number is dependent on the vendor and the
device family), a flip-flop and a multiplexer. There are special
purpose resoures as well, such as memory blocks, signal processing
elements (multipliers) etc.
Pter Horvth
RTL Optimization Techniques
14 / 20
Contents
Timing optimization
Area optimization
Additional readings
Area optimization techniques
Minimizing area control-based logic reuse
Control-based logic reuse should be considered the opposite
operation to the loop unrolling. Pipeline requires internal data
storage resources and additional logic to implement parallel
operation. These resources can be reused with the cost of a
reduced throughput.
in1
in2
in3
in4
32
32
32
32
sel
reset
clk
32
plr2
zero
clk
reset
in4
32
32
1
FSM
ce
plr1
32
0
32
ce
in3
in2
32
+
32
reset
clk
in1
sel_input
zero ce_acc
clk
reset ss_z
32
ce
32
32
reset
clk
32
32
32
1
reset
clk
acc
ce
reset
clk
acc
Control-based logic reuse requires an
FSM to generate control signals.
32
zero
acc
32
acc
Pter Horvth
RTL Optimization Techniques
15 / 20
Contents
Timing optimization
Area optimization
Additional readings
Area optimization techniques
Minimizing area priority encoders
The resource requirement can be improved if the mutual exclusion
is exploited. The elsif statement should be used only if a priority
encoder is required and the conditions are not mutually exclusive.
architecture not_priority of logic is
begin
process (clk)
begin
if (rising_edge(clk)) then
if (ctrl(0) = '1') then
output(0) <= input; end if;
if (ctrl(1) = '1') then
output(1) <= input; end if;
if (ctrl(2) = '1') then
output(2) <= input; end if;
if (ctrl(3) = '1') then
output(3) <= input; end if;
end if;
architecture priority of logic is
begin
process (clk)
begin
if (rising_edge(clk)) then
if (ctrl(0) = '1') then
output(0) <= input;
elsif (ctrl(1) = '1') then
output(1) <= input;
elsif (ctrl(2) = '1') then
output(2) <= input;
elsif (ctrl(3) = '1') then
output(3) <= input;
end if;
end if;
end process;
end architecture;
end process;
end architecture;
Pter Horvth
RTL Optimization Techniques
16 / 20
Contents
Timing optimization
Area optimization
Additional readings
Area optimization techniques
Minimizing area priority encoders
32
32
input[31:0]
input
0
32
32
output_a
clk
1
sel
ctrl
output_a[31:0]
32
32
output_a
clk
1
sel
ctrl
output_a
[0]
[0]
32
32
32
4
32
4
32
output_b
clk
1
sel
32
output_b[31:0]
output_b
clk
1
sel
output_b
[0]
[1]
[1]
32
32
32
32
0
32
output_c
clk
1
sel
0
32
output_c[31:0]
output_c
clk
1
sel
output_c
[0]
[1]
[2]
[2]
32
32
32
32
4
0
32
output_d
clk
1
sel
0
32
output_d[31:0]
output_d
clk
1
sel
[0]
[1]
[2]
[3]
output_d
[3]
without exploiting mutual exlusion
Pter Horvth
with exploiting mutual exclusion
RTL Optimization Techniques
17 / 20
Contents
Timing optimization
Area optimization
Additional readings
Area optimization techniques
Minimizing area considering technology primitives
With appropriate HDL coding style a more efficient logic
synthesis can be achieved. The synthesis tool vendors usually
provide coding technique proposals to improve the resource
requirement or timing parameters of the design. The proposed
coding style takes the unique characteritics of the technology
primitives into consideration.
utilizing block RAM modules in FPGAs: Block RAM modules do
not have any reset inputs and their outputs are synchronous to a
clock signal. Only HDL models with these parameters can be
implemented in block RAMs.
utilizing high quality DSP units: The DSP slices in the FPGAs have
synchronous outputs. This restriction have to be taken into account
in HDL model generation.
Pter Horvth
RTL Optimization Techniques
18 / 20
Contents
Timing optimization
Area optimization
Additional readings
Area optimization techniques
Minimizing area considering technology primitives
architecture FFS of RAM is
begin
process (clk)
begin
if (reset = '1') then
content <= (others=>(others=>'0'));
elsif (rising_edge(clk)) then
if (write = '1') then
content(address) <= data_in;
end if;
end if;
end process;
data_out <= content(address);
end architecture;
Because of the asynchronous
output this model cannot be
implemented in block RAM.
The reset function hinders the
LUT implementation as well.
Pter Horvth
architecture BRAM of RAM is
begin
process (clk)
begin
if (rising_edge(clk)) then
if (write = '1') then
content(address) <= data_in;
end if;
data_out <= content(address);
end if;
end process;
end architecture;
This model can be implemented
as flip-flops, LUT RAM and
block RAM as well.
RTL Optimization Techniques
19 / 20
Contents
Timing optimization
Area optimization
Additional readings
Additional readings
Additional readings
Steve Kilts Advanced FPGA Design, Architecture, Implementation,
and Optimization
David Money Harris, Sarah L. Harris Digital Design and Computer
Architecture
Peter J. Ashenden Digital Design An Embedded System
Approach Using VHDL
M. Moris Mano, Charles R. Kime Logic and Computer Design
Fundamentals
Pong P. Chu RTL Hardware Design Using VHDL
Peter Wilson Design Recipes for FPGAs
Pter Horvth
RTL Optimization Techniques
20 / 20