0% found this document useful (0 votes)
71 views53 pages

Lecture7 Fall 21

The document discusses synthesis of Verilog code, focusing on for loops, generate statements, and use of x. It states that for loops with a fixed length can be synthesized by unrolling the loop. Generate statements allow conditional instantiation at compile time. While x can be used as a "don't care" in case statements in Verilog, it cannot represent an actual hardware value since it is not synthesizable.

Uploaded by

ifire
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
71 views53 pages

Lecture7 Fall 21

The document discusses synthesis of Verilog code, focusing on for loops, generate statements, and use of x. It states that for loops with a fixed length can be synthesized by unrolling the loop. Generate statements allow conditional instantiation at compile time. While x can be used as a "don't care" in case statements in Verilog, it cannot represent an actual hardware value since it is not synthesizable.

Uploaded by

ifire
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

ET5080E

Digital Design Using Verilog


HDL
Fall ‗21
For Loops & Synthesis
Generate Statements
Use of X in Synthesis
Synthesis Pitfalls
Coding for Synthesis
For Loops & Synthesis
 Can a For Loop be synthesized?
• Yes, if it is fixed length
• The loop is “unrolled”
reg [15:0] countmem [0:7];
integer x; How do you think this
always @(posedge clk) begin code would synthesize?
for (x = 0; x < 8; x = x + 1) begin
countmem[x] <= countmem[x] +1;
end
end

2
For Loops & Synthesis

+1 +1 +1

clk
16
clk
16 … clk
16

16 16 16
countmem[0] countmem[1] countmem[7]

 These loops are unrolled when synthesized


• That‘s why they must be fixed in length!
• Loop index is type integer but it is not actually synthesized
• Example creates eight 16-bit incrementers.
 What if loop upper limit was a parameter?
3
Unnecessary Calculations
 Expressions that are fixed in a for loop are replicated due to
―loop unrolling.‖
 Solution: Move fixed (unchanging) expressions outside of all
loops.
for (x = 0; x < 8; x = x + 1) begin This is just basic common sense,
for (y = 0; y < 8; y = y + 1) begin and applies to any language (in a
programming language you
index = x*8 + y;
would be wasting time, not
value = (a + b)*c; hardware).
mem[index] = value;
Yet this is a common mistake
end
end

 Which expressions should be moved?


4
More on Loops & Synthesis
 A loop is static (data-independent) if the number of
iterations is fixed at compile-time
 Loop Types
• Static without internal timing control
Combinational logic
• Static with internal timing control (i.e. @(posedge clk))
Sequential logic
• Non-static without internal timing control
Not synthesizable
• Non-static with internal timing control (i.e. @(posedge clk))
Sometimes synthesizable, Sequential logic
5
Static Loops w/o Internal Timing
 Combinational logic results from ―loop unrolling‖
 Example
always@(a) begin
andval[0] = 1;
for (i = 0; i < 4; i = i + 1)
andval[i + 1] = andval[i] & a[i];
end
 What would this look like?
 For registered outputs:
• Change sensitivity list ‗a‘ with ‗posedge clk‘
6
Static Loops with Internal Timing
 If a static loop contains an internal edge-sensitive
event control expression, then activity distributed
over multiple cycles of the clock
always begin
for (i = 0; i < 4; i = i + 1)
@(posedge clk) sum <= sum + i;
end

 What does this loop do?


 Does it synthesize?...Yes, but…
7
Non-Static Loops w/o Internal Timing
 Number of iterations is variable
• Not known at compile time
 Can be simulated, but not synthesized!
 Essentially an iterative combinational circuit of data dependent
size!
always@(a, n) begin
andval[0] = 1;
for (i = 0; i < n; i = i +1)
andval[i + 1] = andval[i] & a[i];
end

 What if n is a parameter?

8
Non-Static Loops with Internal Timing
 Number of iterations determined by
• Variable modified within the loop
• Variable that can‘t be determined at compile time
 Due to internal timing control—
• Distributed over multiple cycles
• Number of cycles determined by variable above
 Variable must still be bounded
Can this be synthesized?
always begin
continue = 1’b1; What does it synthesize to?
for (; continue; ) begin
@(posedge clk) sum = sum + in;
if (sum > 8’d42) continue = 1’b0; Who really cares!
end This is a stupid way to do it!
end Use a SM.

9
Any loop with internal timing can be
done as a SM
Previous example didn’t
module sum_till (clk,rst_n,sum,in); really even require a SM.
input clk,rst_n; This code leaves no
input [5:0] in; question how it would
output [5:0] sum; synthesize.
always @(posedge clk or negedge rst_n)
if (~rst_n) RULE OF THUMB:
sum <= 6’h00;
If it takes you more than 15
else if (en_sum) seconds to conceptualize
sum <= sum + in how a piece of code will
assign en_sum = (sum<6’d43) ? 1’b1 : 1’b0; synthesize, then it will
probably confuse Synopsys
endmodule too.

10
FSM Replacement for Loops
 Not all loop structures supported by vendors
 Can always implement a loop with internal timing
condition
using an FSM
condition
• Can make a ―while‖ loop easily
• Often use counters along with the FSM State3

 All synthesizers support FSMs!


 Synopsys supports for-loops with a static number of
iterations

11
Generated Instantiation
 Generate statements  control over the instantiation/creation
of:
Modules
UDPs & gate primitives
continuous assignments
initial blocks & always blocks
 Generate instantiations resolved during ―elaboration‖ (compile
time)
When module instantiations are linked to module definitions

Before the design is simulated or synthesized – this is NOT


dynamically created hardware

12
Generate-Loop
 A generate-loop permits making one or more instantiations
(pre-synthesis) using a for-loop.

module gray2bin1 (bin, gray);


parameter SIZE = 8; // this module is parameterizable
output [SIZE-1:0] bin; input [SIZE-1:0] gray;
genvar i; Does not exist during simulation of design
generate
for (i=0; i<SIZE; i=i+1) begin: bit
assign bin[i] = ^gray[SIZE-1:i]; \\ Data Flow replication
end
endgenerate
Typically name the generate as reference
endmodule

13
Generate Loop
 Is really just a code replication method. So it can be used with
any style of coding. Gets expanded prior to simulation.
module replication_struct(i0, i1, out);
parameter N=32;
Hierarchical reference to these
input [N-1:0] i1,i0; instantiated gates will be:
output [N-1:0] out;
genvar j; …xor_loop[0].g1
…xor_loop[1].g1
generate .
for (j=0; j<N; j=j+1) .
begin : xor_loop .
xor g1 (out[j],in0[j],in1[j]); …xor_loop[31].g1
end
endgenerate Structural replication
endmodule
14
Generate-Conditional
 A generate-conditional allows conditional (pre-synthesis) instantiation
using if-else-if constructs

module multiplier(a ,b ,product);


parameter a_width = 8, b_width = 8;
localparam product_width = a_width+b_width; These are
input [a_width-1:0] a; input [b_width-1:0] b; parameters,
not variables!
output [product_width-1:0] product;
generate
if ((a_width < 8) || (b_width < 8))
CLA_multiplier #(a_width,b_width) u1(a, b, product);
else
WALLACE_multiplier #(a_width,b_width) u1(a, b, product);
endgenerate
endmodule

15
Generate-Case
 A generate-case allows conditional (pre-synthesis) instantiation using case
constructs
 See Standard 12.1.3 for more details

module adder (output co, sum, input a, b, ci);


parameter WIDTH = 8;
generate
case (WIDTH)
1: adder_1bit x1(co, sum, a, b, ci); // 1-bit adder implementation
2: adder_2bit x1(co, sum, a, b, ci); // 2-bit adder implementation
default: adder_cla #(WIDTH) x1(co, sum, a, b, ci);
endcase
endgenerate
endmodule Of course case selector has to be
deterministic at elaborate time, can not be
a variable. Usually a parameter.

16
Synthesis Of x And z
 Only allowable uses of x is as ―don‘t care‖, since x
cannot actually exist in hardware
• in casex
• in defaults of conditionals such as :
The else clause of an if statement
The default selection of a case statement

 Only allowable use of z:


• Constructs implying a 3-state output
Of course it is helpful if your library supports this!

17
Don’t Cares
 x, ?, or z within case item expression in casex
• Does not actually output ―don‘t cares‖!
• Values for which input comparison to be ignored
• Simplifies the case selection logic for the synthesis tool

casex (state) state[1:0]


3’b0??: out = 1’b1; state[2]
00 01 11 10
3’b10?: out = 1’b0; 0 1 1 1 1
3’b11?: out = 1’b1;
1 0 0 1 1
endcase

out = state[0] + state[1]

18
Use of Don’t Care in Outputs
 Can really reduce area
case (state) state[1:0]
00 01 11 10
3’b001: out = 1’b1; state[2]
3’b100: out = 1’b0; 0 0 1 0 0
3’b110: out = 1’b1; 1 0 0 0 1
default: out = 1’b0;
endcase

case (state) state[1:0]


00 01 11 10
3’b001: out = 1’b1; state[2]
3’b100: out = 1’b0; 0 x 1 x x
3’b110: out = 1’b1; 1 0 x x 1
default: out = 1’bx;
endcase
19
Unintentional Latches
 Avoid structural feedback in continuous assignments,
combinational always
assign z = a | y; a
z
assign y = b | z; y

 Avoid incomplete sensitivity lists in combinational always


 For conditional assignments, either:
• Set default values before statement
• Make sure LHS has value in every branch/condition
 For warning, set hdlin_check_no_latch true before compiling
20
Synthesis Example [1]
module Hmmm(input a, b, c, d, output reg out); a|b enables latch
always @(a, b, c, d) begin
if (a) out = c | d;
else if (b) out = c & d;
end
endmodule

How will this synthesize?

Area = 44.02

Either c|d or c&d are passed


through an inverting mux
depending on state of a / b

21
Synthesis Example [2]
module Better(input a, b, c, d, output reg out);
always @(a, b, c, d) begin
if (a) out = c | d;
else if (b) out = c & d;
else out = 1’b0;
end
endmodule

Perhaps what you meant


was that if not a or b then
out should be zero??

Area = 16.08

Does synthesize better…no latch!

22
Synthesis Example [3]
module BetterYet(input a, b, c, d, output reg out); Area = 12.99
always @(a, b, c, d) begin
if (a) out = c | d;
else if (b) out = c & d;
else out = 1’bx;
end
endmodule

But perhaps what you meant


was if neiter a nor b then you
really don’t care what out is.

Hey!, Why is b not used?

23
Gated Clocks
 Use only if necessary (e.g., for low-power)
• Becoming more necessary with demand for many low
power products
This clock arrives later
than the system clock
it was derived from.
We just created a min-
delay problem. (race
condition) (shoot
through)
clk
clk_en

Min_delay_slack = clk2q – THold – Skew_Between_clks

24
Gated Clocks
Gated clock domains can’t be treated lightly:
1.) Skew between domains
2.) Loading of each domain. How much capacitance is on it? What is its
rise/fall times
3,) RC in route. Is it routed in APR like any old signal, or does it have
priority?
Clocks are not signals…don’t treat them as if they were.
1.) Use clock tree synthesis (CTS) within the APR tool to balance clock
network (usually the job of a trained APR expert)

2.) Paranoid control freaks (like me) like to generate the gated domains in
custom logic (like clock reset unit). Then let CTS do a balanced distribution
in the APR tool.

Our guest lecturer will cover some of this material….

25
Chain Multiplier
module mult(output reg [31:0] out,
input [31:0] a, b, c, d);

always@(*) begin
out = ((a * b) * c) * d;
end

endmodule

Area: 47381
Delay: 8.37

26
Tree Multiplier
module multtree(output reg [31:0] out,
input [31:0] a, b, c, d);
always@(*) begin
out = (a * b) * (c * d);
end
endmodule

Area: 47590
Delay: 5.75 vs 8.37

27
Multi-Cycle Shared Multiplier
module multshare(output reg [31:0] out,
input [31:0] in, input clk, rst);
reg [31:0] multval;
reg [1:0] cycle;
always @(posedge clk) begin
if (rst) cycle <= 0;
else cycle <= cycle + 1;
out <= multval;
end

always @(*) begin


if (cycle == 2'b0) multval = in;
else multval = in * out;
end
endmodule
28
Multi-Cycle Shared Multiplier (results)

Area: 15990 vs 47500


Delay: 4*3.14

4 clocks, minimum period 3.14

29
Shared Conditional Multiplier
module multcond1(output reg [31:0] out,
input [31:0] a, b, c, d, input sel);

always @(*) begin


if (sel) out = a * b;
else out = c * d; Mutually exclusive use of the multiply
end
Area: 15565
endmodule
Delay: 3.14

30
Selected Conditional Multiplier [1]
module multcond2(output reg [31:0] out,
input [31:0] a, b, c, d, input sel);

wire [31:0] m1, m2;


assign m1 = a * b;
assign m2 = c * d;

always @(*) begin


if (sel) out = m1;
else out = m2;
end

endmodule

31
Selected Conditional Multiplier [1]
 Area: 30764 vs. 15565
 Delay: 3.02 vs. 3.14

 Why is the area larger and delay


lower?

32
Decoder Using Indexing

What does synthesis do?

Think of Karnaugh Map

33
Decoder Using Loop

For this implementation how


are we directing synthesis to
think?

Assign each bit to digital


comparator result

34
Decoder Verilog: Timing Comparison

Loop method
Starts to look
advantageous

35
Decoder Verilog: Area Comparison

Loop method
Starts to look
advantageous

36
Decoder Verilog: Compile Time Comparison

Holy Mackerel Batman!

What is synthesis
doing?

Why is it working so
long and hard?

Looking for shared


terms

37
Late-Arriving Signals
 After synthesis, we will identify the critical path(s) that
are limiting the overall circuit speed.
 It is often that one signal to a datapath block is late
arriving.
 This signal causes the critical path…how to mitigate?:
• Circuit reorganization
Rewrite the code to restructure the circuit in a way that minimizes the
delay with respect to the late arriving signal
• Logic duplication
This is the classic speed-area trade-off. By duplicating logic, we can
move signal dependencies ahead in the logic chain.

38
Logic Reorganization Example [1]

39
Logic Reorganization Example [2]

What can we do if A is the late-arriving signal?

40
Logic Reorganization Example [3]

That’s right! We have to do the


math, and re-arrange the
equation so the comparison does
not involve and arithmetic
operation on the late arriving
signal.

41
Logic Reorganization Example [4]
Why the area improvement?

Synopsys didn’t spend so


much effort upsizing gates to
try to make transisitons faster.
This new design is faster,
lower area, and lower power

42
Logic Duplication Example [1]

43
Logic Duplication Example [2]

What if control is the late arriving signal?

44
Logic Duplication Example [3]

45
Logic Duplication Example [4]

46
Exercise
 Assume we are implementing the below code, and cin
is the late arriving signal? How can we optimize the
resulting hardware for speed? At what cost?

reg [3:0] a, b;
reg [4:0] y;
reg cin;

y = a + b + cin;

47
Exercise
 Revise to maximize performance wrt late

reg [3:0] state;


reg late, y, x1, x2, x3; Actually, there is nothing you
can really do here. This is
simple boolean logic, and
case(state) synopsys already does a
good job optimizing for late
SOME_STATE: arriving.
if (late) y = x1;
Coding optimizations often
else y = x2; apply more to larger functions
default: (like arithmetic operations, &
comparisons). Than to
if (late) y = x1; boolean logic.
else y = x3;
endcase
48
Mixing Flip-Flop Styles (1)
 What will this synthesize to?
module badFFstyle (output reg q2, input d, clk, rst_n);
reg q1;

always @(posedge clk)


if (!rst_n)
q1 <= 1'b0;
else begin
q1 <= d;
q2 <= q1; If !rst_n then q2 is not assigned…
It has to keep its prior value
end
endmodule
49
Flip-Flop Synthesis (1)
 Area = 59.0

Note: q2 uses an enable flop (has mux built inside)


enabled off rst_n

50
Mixing Flip-Flop Styles (2)

module goodFFstyle (output reg q2, input d, clk, rst_n);


reg q1;

always @(posedge clk)


if (!rst_n) q1 <= 1'b0; Only combine like flops (same
reset structure) in a single
else q1 <= d; always block.

always @(posedge clk) If their reset structure differs, split


into separate always blocks as
q2 <= q1; shown here.

endmodule

51
Flip-Flop Synthesis (2)
 Area = 50.2 (85% of original area)

Note: q2 is now just a simple flop as intended

52
Flip-Flop Synthesis (3) Note asynch area less
 Using asynchronous reset instead than synch, and cell
count less (less
• Bad (same always block): Area = 58.0 interconnect)

• Good (separate always block): Area = 49.1

53

You might also like