Lecture RC
Lecture RC
loop unrolling
} A[1][3..0] • fast parallel
+
slow and processing
42 * Q[1]
power hungry
- pipelining
24
von Neumann computer - loop transform.
...
LDI reg_i,0 • no instr. fetch
A[7][3..0]
L1:ANDI r_tmp,$i,xF + (no extra
... 42 * Q[7] memory access)
BLI reg_i,L1 24
instruction pipelining • no instr. decode
stream
• possibility of
data ... A[7] ...
A[0]
dedicated instr.
stream 42 ...
Q[0] (e.g., MAC)
42 Q[7]
24 • lower power
24 3
Introduction: Example (Benefits)
Reconfigurable computing permits to tradeoff between performance
(speed and/or latency) and area (number of used primitives) of the
reconfigurable architecture. This requires to solve the following steps:
8
What we should know about FPGAs
Slow (~300 MHz), but highly parallel execution >1000 Operations
Moderate I/O throughput, but >1MB @ >1TB/sec (on-chip)
Difficult VHDL programming, but C++ is coming up
9
PR Advantages: Area Saving
Networking:
Adapt to changing
protocols over
source:www.caida.org
time
Encapsulated
design of the
processing
modules
configuration
FPGA network processor
repository
dispatcher config. HTTP
VoIP
SSH
FTP
10
PR Advantages: Area Saving
Economics of ASIC- and FPGA designs
A B C A
C
B
A
C
time
S0 A S0
S1 B latency S1 A B C A B C A
S2 C S2
t t
May alternatively allow to reduce clock frequency (and power)
Lower latency might reduce buffer sizes
Example: TLS/SSL, sorting (database acceleration)
May also increase throughput
12
PR Advantages: Faster Configuration
Full FPGA bistream can currently be > 20 MB
Flash memory performance 10-20 MB/s
(special high-speed Flash memories reach up to 100 MB/s)
Full initial configuration ~ 1-2 seconds in practice
an order of magnitude to slow for PCIe (setup within 100 ms)
Solution: Bootstrapping using PR
13
PR Advantages: IP Reuse
10 000 000
1 000 000
+58% / year
logic transistors/year 100 000
design gap 10 000
1000
100
+21% / year 10
productivity in tr. per man-month
1
1980 1985 1990 1995 2000 2005 2010
[International Technology Roadmap for Semiconductors]
Stratix-IV
Virtex-6
Virtex-6
600 K 30 MB
Stratix-IV
Stratix-V
500 K 25 MB
Virtex-7
Stratix-V
Virtex-7
Stratix-III
Stratix-III
400 K 20 MB
Virtex-5
Virtex-5
Virtex-II Pro
Virtex-4
Virtex-II Pro
300 K 15 MB
Stratix-II
Stratix-II
Virtex-4
Virtex-II
Virtex-II
200 K 10 MB
Stratix
Stratix
100 K 5 MB
2000 2002 2004 2006 2008 2010 2000 2002 2004 2006 2008 2010
130 nm 90 nm 56 nm 40 nm 28 nm 130 nm 90 nm 56 nm 40 nm 28 nm
*Website: http://www.matnat.uio.no/forskning/prosjekter/crc 18
Baseline Model of Partial Reconfiguration
The time-multiplex model: configuration data (bitstream)
internal configuration logic
FPGA phone
phone
video
<=> video
MP3
MP3
...
...
switching have to be b31 a31 b30 a30b29 a29 b28 a28
mapped on a 2D chip time
longer routing paths VA VA VA VA via
Difficult tools (manual ma- s31 s30 s29 s28
nipulation or simulation)
http://www.tabula.com
...
21
PR Time-Granularity (single-cycle)
Multi-context FPGAs
originally proposed by
Scalera & Trimberger
single cycle confi-
guration swapping
idea: duplicating all
configuration bits
for each “plane”
and multiplexing between planes
Problem: extra multiplexer required for each configuration bit
All planes have to be mapped on a 2D chip (3D 2D mapping)
longer routing between the primitives <=> lower performance
Bad idea for FPGAs: most of the FPGA die area is spent on confi-
guration SRAM cells) usefull only for coarse-grained architectures
Better: multiplexing between different areas on the FPGA
22
PR Time-Granularity (multi-cycle)
Configuration by writing a new configuration bitstream to the device
normal case for all FPGAs from Xilinx and Altera (starting with
the Stratix-5 family)
23
PR in Time and Space
So far, we have only considered to have one module exclusively
placed with a reconfigurable region (temporal partial reconfigurati-
on) extension to multi-module placement of partially reconfigu-
rable modules (spatial partial reconfiguration)
Possibilities for tiling the reconfigurable area into resource slots:
island
a) island style style slot style
b) slot style c) grid grid
style style
m3
m4
m1 m2 m1 m2 m1 m2
m3
static part of the system unused reconfigurable area different modules
As smaller the slots, as lower the internal fragmentation (the waste
of logic resulting from fitting any sized module into a tile-grid (i.e.,
clustering the FPGA area into regular groups of resources)) 24
PR in Time and Space: Efficiency
PR paradox: Runtime reconfiguration is brilliant, but not used!
internal fragmentation M
m1 m2 m3 m1 m2 m5 cconst
m4
m3
m4
m6
cconst
communication cost c M M M overhead
Internal fragmentation is dominating the overhead
Can be optimized with small slots 2D placement
(but might result in additional cost for the communication)
2D enhances BRAM/DSP utilization
2D is obligatory for newer FPGA Architectures (Virtex-5/6)
Requires adequate on-FPGA communication architectures
Buses
Point-to-point connections 25
Optimal Resource Slot Size
Internal fragmentation results from fitting modules into a grid of
fixed resource slots.
Analog: storing files in a filesystem with fixed clusters
Average overhead of a module set of modules:
l : resources in a slot
c : communication
mi : resources of module i
– =
3000
2500
2000 0 5 10 15 20 25
1500
1000
500
0
50
50
50
50
50
50
50
50
50
0
0
25
45
65
85
10
12
14
16
18
20
22
24
resource slot size in terms of LUTs
Result: optimal slot size ~200–300 LUTs or ~25–40 CLBs
27
Optimal Resource Slot Size
l : resources in a slot
c : communication
mi : resources of module i
Discussion:
If mi >> l the overhead converges to l/(l-c), meaning that for
large modules (with many resource slots) the internal
fragmentation becomes neglible.
The optimal slot size can be computed by differtentiating the
avarage module overhead with respect to the slot size l.
As the ceiling function is discontinuous, its bounds are
considered:
l l l
l
Worst case: l
l
Avarage
case:
(achievable
only with 2D
grid style
placement)
29
Optimal Resource Slot Size
One of the best published solutions:
Hagemeyer et al., Design of
Homogeneous Communication
Infrastructures for Partially
Reconfigurable FPGAs
ERSA, USA 2007.
Master and slave support (32 bit)
16 sockets (XC2V4000)
Communication cost: 8554 LUTs
(~three 32-bit CPU-cores)
No I/O support
Resource slot size: 2560 LUTs
0 7 O111 0 1
8 1OOO 0 1
1
9 1OO1 0 1
A 1O1O 0 1
LUT values
B 1O11 0 1
F C 11OO 0 1
D 11O1 0 1
A0
A1 E 111O 0 1
A2 Slice FF F 1111 1 1
A3
FF 31
PR Space-Granularity (small modules)
Sometimes, even small modules can materially speed-up a system.
Example: reconfigurable customized instruction set extensions
(e.g., with instructions for CRC, DES round, bit swapping)
result result
a) b)
register file register file OP_B
OP_A
OP_A OP_B
instruction instruction conf. conf.
instr. instr.
m1 m2 m1 m2 m5 cconst
m3
m4
m3
m4
m6
cconst
communication cost c M M M overhead
Difficulty: modules have different resource requirements
Logic logic memory
Memory
reconfigurable
Multipliers L L L L M L L L L L L L L M L L FPGA region
Placement restrictions
placement
(string matching problem) options module
L LML
cache
ALU
bus
34
PR Space-Granularity (module coupling)
tightly coupled loosely coupled
register file cache system memory
complexity (size)
NIOS
NIOS II
[1] DISC
cache
ALU
bus
Module may contain
own register file
RHW
35
PR Space-Granularity (module coupling)
tightly coupled loosely coupled
register file cache system memory
complexity (size)
[2] GARP
M.Blaze
NIOS
NIOS II
[1] DISC
cache
ALU
bus
reconfigurable HW
[2] Hauser and Wawrzynek (FCCM 97):
GARP: A MIPS Processor with
RHW
a Reconfigurable Coprocessor
37
PR Space-Granularity (module coupling)
tightly coupled loosely coupled
register file cache system memory
complexity (size)
[2] GARP
M.Blaze
PPC V4
NIOS
NIOS II
[1] DISC
cache
ALU
reconfigurable HW bus
Decoupled by Fifo
channels (FSL-Fifo)
RHW
Parallel execution
38
PR Space-Granularity (module coupling)
tightly coupled loosely coupled
register file cache system memory
complexity (size)
[2] GARP
M.Blaze
PPC V4
NIOS
NIOS II
[1] DISC
cache
ALU
bus
Common FPGA-based
approaches require an interface interface
40
On-FPGA Communication
Goal: an efficient on-FPGA communication architecture that
supports the grid-style module placement.
Classification of different on-chip communication architectures:
On-Chip Communication
Hierarchical
Custom Uniform Split bus Homogeneous Heterogene
shared Bus
source:
41
On-FPGA Communication (History)
Progress in Partial reconfiguration (physical implementation)
using the Xilinx tools over the last decade:
Fundamental problem: binding of the partial module entity sig-
nals to fixed routing resources of the FPGA fabric „module plug“
PR region static system
'
0' '
0' '
1' '
1' '
1' '
0' '
0' '
0' '
1' '
1' '
1' '
0'
NAND OR
OR NAND
OR
OR NAND
OR
OR
NAND
„PR links“
Binding entity signals to the
wires crossing the border to
a reconfigurable module.
No logic overhead, cleaner design flow, supports S6 (V5, V6) 45
On-FPGA Communication: Buses
Bus macros are best suited to integrate modules into islands!
The following slides present structured communication architec-
tures for slot-based (1D) or grid-style (2D) module placement
46
ReCoBus Communication
All bus protocols can by implemented by the use of
four signal classes:
shared write shared read
dedicated write dedicated read
__
R\W
address
write_data
shared master read_data
write signals
interrupt_1 dedicated master
interrupt_2 read signals
master
master
& & u
m
≥1 ≥1 ≥1 m
≥1 y
... '
0'
'
0'
D7..0
'
0'
D15..8
'
0'
D23..16
'
0'
D31..24
49
ReCoBus: Interleaving
Problem: the structure of a distributed read multiplexer chain is
unlikely for very fine-grained resource slot layouts:
reconfigurable area
SSlot
1 S 12 SSlot
3 S24 S Slot
5 S36 S Slot
7 S48 SlotS510 S 11
S9 SlotS612
'
0'
D7..0
'
0'
D15..8
'
0'
D23..16
'
0'
D31..24
52
ReCoBus: Interleaving
Solution: multiple interleaved read multiplexer chains
reconfigurable area
S1 S2 S3 S4 S5 S6 S7 S8 S 9 S 10 S 11 S 12
D31..24
D7..0 '
0'
D7..0
D
D15..8
D23..16
15..8
53
ReCoBus: Signal Alignment (1D)
Example system:
Module 0 Module 1
S1 S2 S3 S4 S5 S6 S7 S8
CPU en dout en dout en dout en dout en dout en dout
& & & & & &
≥1 ≥1 1 1 ≥1 ≥1 ≥1 ≥1
0 start point & mux select value 0123 0123 0123 0123
0 used connection
0 unused connection D31...D24 D23...D16 D15...D8 D7...D0 55
ReCoBus: Signal Alignment (2D)
The signal interleaving scheme can be extended to implement
buses allowing to integrate modules in a 2D grid style.
y m1
slot indexing: sloty,x 0 1 2 3 0 1 2 3
0
m2
1 1 2 3 0 1 2 3 0
m3
2 2 3 0 1 2 3 0 1
m4
3 3 0 1 2 3 0 1 2
≥1 ≥1 ≥1 ≥1 0 1 2 3 4 5 6 7
x
slot3,6
0 start point & mux select value 0123 0123 0123 0123
0 used connection
0 unused connection D31...D24 D23...D16 D15...D8 D7...D0
56
ReCoBus: Dedicated Write Signals
LUTs can be used to decode an address within the bus value
(the table contains then a one-hot value, e.g. for addr. 0xA) 0 0
1 0
LUT values 2 0
LUT in 3 0
Sin
0 0 SRL16 mode 4 0
1 1 5 0
6 0
...
7 0
F F 8 0
A0 9 0
A0
A1 A1 A 1
A2 Slice FF A2
A3 B 0
A3
Sout C 0
FF FF
D 0
For setting an address, LUT values can be exchanged: E 0
Using the configuration port, F 0
Q15
Din
Q0
config_clock module_reset
bus_enable 4
module_select
bus_read
& module_read
Two-stage reconfiguration:
1. FPGA: initialize the shift register with 0xFFFF
2. Logic: configure address comparator and activate module
58
ReCoBus: Dedicated Write Signals
Arrangement of the address comparators
module_select
module_reset
module side
module_read
look-up table
fits into one
module 1 module 2
slot 0 slot 1 slot 2 slot 3 slot 4 slot 5
&
read_en
CPU
select
select
reset
reset
Q15
module module
select select
bus_enable logic config_data logic bus_read
config_clock d
u
bus m
logic m
y
Q0
EN
4
Din
Allows module relocation
config_clock
bus_read
config_data
bus_enable
Multiple instances of a module
bus side
(individual module addresses)
Automatic reset generation
No interference by the reconfiguration process (Hot-Plug)
Extra register file look-up for alignment multiplexer control 60
ReCoBus: Dedicated Write Signals
Assuming an 8-bit interface pro slot, it takes at least four
consecutive slots to provide the full interface size
select 9876543210
15
14
13
12
11
10
master A7...A0
module F F reserved
select E E module_selectE
logic D D
A15...A12
A11...A8
A7...A0 C C
B B
A A FF
... 9 9 FE
8 8 module
&
...
7 7 register
≥1 6 6
bus_enable 5 5 file
01
≥1 ... 4 4 00
3 3
≥1 2 2
module 1 1 1
≥1 0 0 module_select0
module 1 module 2
slot 0 slot 1 slot 2 slot 3 slot 4 slot 5
IRQ dummy IRQ dummy
CPU sink sink
1 d
0 u
m
m
y
63
I/O-Bars
64
I/O-Bars for Point-to-Point Links
Horizontal routing track within the reconfigurable area
Connections are set by modifying switch matrices
One bar per interface requirement (e.g., video, audio)
bypass
static system
Slot 0 Slot 1 Slot 2 Slot 3 Slot 4 Slot 5
video video
out in
audio audio
out in
ReCoBus
static system
65
I/O-Bars for Point-to-Point Links
• Read-modify-write connection
• Ideal for data streaming
static system
Slot 0 Slot 1 Slot 2 Slot 3 Slot 4 Slot 5
video video
out in
audio audio
out in
ReCoBus
static system
66
I/O-Bars for Point-to-Point Links
• I/O bar implementation
67
I/O-Bars for Point-to-Point Links
I/O-Bar implementation for 2D
Vertical routing is accomplished in the static part
Can be used with interleaving for decreasing latency
(requires signal alignment in each module)
68
Demo System
248 logic slots
(192 LUTs/slot)
+16 RAM slots
8-bit slave bus
(up to 48 bit via
6 sequent slots)
Video streaming
Free placement
Connection cost:
14 LUTs/slot
100 MHz
(XC2V6000-6)
69
Demo System
Regular structured ReCoBus macro (a macro contains logic
and routing and is instantiated like any other VHDL module)
Implementation on a
XC2V-6000
One CLB provides up
to 8 data signals
(for read and write)
Lower CLB packing can
improve routing (congestion around the connecting resources
70
Design Flow
ReCoBus & connection bar protocol specification
Design Entry, Static/Dynamic
[ReCoBus-Builder]
physical implementation
budgeting floorplanning and communication budgeting
[Xilinx XST] synthesis [ReCoBus-Builder] [Xilinx XST] bus & bar static module1
module
module
static static ReCoBus module1
RTL model system 11
module1 module 1
netlist constraints I/O bars constraints templates
netlist
72
Design Flow
ReCoBus & connection bar protocol specification
Physical Implementation
[ReCoBus-Builder]
floorplanning and
budgeting budgeting
physical implementation
budgeting floorplanning and communication budgeting communication synthesis
[Xilinx XST] synthesis [ReCoBus-Builder] [Xilinx XST] [Xilinx XST] [Xilinx XST]
[ReCoBus-Builder]
static static ReCoBus module1 module1
module 1
netlist constraints I/O bars constraints templates
netlist
static static ReCoBus module1 module1
module 1
place & route static [PAR] buildpartial
build
place&routepartial
modulemodule
module
1 [PAR]
11
netlist constraints I/O bars constraints templates
netlist
build static bitstream [bitgen] build module1 bitstream [bitgen]
bitstream assembly
73
Design Flow
ReCoBus & connection bar protocol specification
Bitstream Assembly
[ReCoBus-Builder]
functional simulation [Modelsim] build static bitstream [bitgen] build module1 bitstream [bitgen]
n
OK?
static system module
fullmodule
module 111
bitfile bitfile
bitfile
bitfile
physical implementation
budgeting floorplanning and communication budgeting
[Xilinx XST] synthesis [ReCoBus-Builder] [Xilinx XST]
74
Design Flow
Tested Design
floorplanning and
budgeting budgeting
communication synthesis
[Xilinx XST] [Xilinx XST]
[ReCoBus-Builder]
76
Design Flow: Blocking
77
Design Flow: Xilinx PlanAhead
New advanced GUI for the complete FPGA design flow
Project management
Floorplanning
Implementation viewer
Source: Xilinx
78
Design Flow: Xilinx PlanAhead
1. Step: Synthesis of all partial and static modules in
individual netlists
(Static netlist has black boxes for the modules)
2. Step: Creation of a new
PlanAhead project
3. Step: Creation of
Reconfigurable Partitions
A reconfigurable partition
(RP) consists of several
reconfigurable modules
(RM)
Assign a partial netlist to
each RM
A RM can also be a black
box (empty module)
79
Design Flow: Xilinx PlanAhead
4. Step: Floor planning of the reconfigurable partitions
Create Area Groups
PlanAhead automatically creates the communication ports
for the reconfigurable partition
PlanAhead
automatically
creates the user
constraints file
(UCF) with the
bounding box Source: Xilinx
definitions of the RPs 80
Design Flow: Xilinx PlanAhead
5. Step: Run design rule check (DRC) to verify the design
6. Step: Create the first reconfigurable configuration
Consisting of the static module and for each RP a RM
Implement this configuration
Promote this configuration
7. Step: Create further
configurations
for each module in a RP:
Import the static design
Implement the partial
module
81
Design Flow: Xilinx PlanAhead
Differences between the ReCoBus-Builder approach and PlanAhead:
Slot-style or grid-style vs. island style reconfiguration (island style
has no external fragmentation problem simple placement)
ReCoBus allows module relocation
"proxy logic"
and multi module instantiation
In other words:
computing module placement positions
and schedules
mem-contr.
FPGA initial step: fully sorted sequences
DDR3
[intermediate steps]: merging
2GB/s PCIe
final step: merge and emit result
8x
MEM prefetcher
unsorted stream
A
max burst size
B
> > > > max latency
C A B C D
> >
D FPGA >
FPGA
sorted output
initial step context switching final step
87