ECE 6775
High-Level Digital Design Automation
Fall 2024
Field-Programmable Gate Arrays
(FPGAs)
Announcements
▸TA-led hands-on tutorial on HLS next Tuesday
– Bring your laptop
1
Exercise: OI Analysis of 2D Convolution
C
Input
image R
frame
Estimate the OI for the 2D convolution kernel both
without and with data use
2
OI Analysis of 2D Convolution w/o LineBuffer
C
for (r = 1; r < R; r++)
for (c = 1; c < C; c++)
Input for (i = 0; i < 3; i++)
image R for (j = 0; j < 3; j++)
frame out[r][c] += img[r+i-1][c+j-1] * f[i][j];
▸ OI without data reuse
– Number of operations = C*R*9*2
(1 multiply + 1 add per pixel)
– External mem accesses (bytes) = C*R*9
(assuming 1 byte per pixel in grayscale)
– Resulting OI = 2
3
OI Analysis of 2D Convolution w/ LineBuffer
C
for (r = 1; r < R; r++)
for (c = 1; c < C; c++)
Input for (i = 0; i < 3; i++)
image R for (j = 0; j < 3; j++)
frame out[r][c] += img[r+i-1][c+j-1] * f[i][j];
▸ OI with data reuse using line buffer
– Number of operations = C*R*9*2
– External mem accesses (bytes) = C*R
– OI = 18
4
Recap: Design Space Exploration with Roofline
120 Bandwidth Roof
Computational Roof
Computational Throughput
100
A B
80 Design points A & B
achieve same throughput
60
But which one would you
prefer?
40
20
0
0 10 20 30 40 50 60
Operational Intensity (OI)
5
Agenda
▸FPGA introduction
– Basic building blocks
– Classical homogeneous FPGA architectures
– Modern heterogeneous FPGA architectures
6
Tradeoff between Compute Efficiency and Flexibility
FLEXIBILITY EFFICIENCY
Register
Contr s
ol
Unit
CPUs
Arithmet GPUs FPGAs ASICs
(CU) ic Logic
Unit
(ALU)
7
What Are FPGAs
▸ Field-programmable gate array
– Can be configured to act like any circuit after manufacturing
– Can do many things – we focus on computation acceleration
8
FPGAs Come In Many Forms
PCIe-Attached In-Storage
CPU Integrated In-Network
9
Building Blocks of Modern FPGA Architectures
▸ A programmable array of logic blocks (LUT, FF),
interconnects, I/Os, and dedicated blocks (BRAM, DSP)
Look-up table (LUT) DSP
⋮
LUT
⋮
⋮ ⋮
10
Counting Boolean Functions
▸How many distinct 2-input 1-output Boolean
functions exist?
▸What about K inputs?
11
Multiplexer as a Universal Gate
▸ Any function of k variables can be implemented with a
2k:1 multiplexer
AA BB Cin
Cin SS Cout
Cout
0 0 0 00 0 ? 0
0 0 1 11 0 ? 1
? 2
0 1 0 11 0 ? 3
4 8:1 MUX Cout
0 1 1 00 1 ?
? 5
1 0 0 11 0 ? 6
7
?
1 0 1 00 1 S2 S1 S0
1 1 0 00 1
? ? ?
1 1 1 11 1
12
Look-Up Table (LUT)
§ A k-input LUT (k-LUT) can be
0/1
configured to implement any k-
0/1
input 1-output combinational
0/1
logic
…
MUX
0/1
– 2k SRAM bits Y
0/1
– Delay is independent of logic
function 0/1
0/1
0/1
x2 x1 x0
A 3-input LUT
13
Exercise: Implementing Logic with LUTs
▸Implement a 2:1 MUX using a network of 2-input
LUTs. Use the minimum number of LUTs
I0
MUX
Y 2-LUT
I1
Building block:
S 2-input LUT
14
A Logic Element
▸ A k-input LUT is usually followed by a flip-flop (FF) that
can be bypassed
▸ The LUT and FF combined form a logic element
15
A Logic Block
▸ A logic block clusters
multiple logic elements
16
Arithmetic Circuitry in Logic Block
Xilinx (now AMD) Intel/Altera
LUTs implement carry propagate and LUTs pass inputs to hardened adders
generation logic
17
Routing Architecture
vs.
Hierarchical routing architecture Island-style routing architecture
18
Traditional Homogeneous FPGA Architecture
Switch
block
Routing
track
Logic
block
19
Modern Heterogeneous Field-Programmable
System-on-Chip (SoC)
▸ Island-style configurable mesh routing
▸ Lots of dedicated components
– Memories/multipliers, I/Os, processors
– Specialization leads to higher performance and lower power
20
[Figure credit: [Link]]
Dedicated DSP Blocks
▸Built-in components for fast arithmetic operation
optimized for DSP applications
– Essentially a multiply-accumulate core with many
other features
– Fixed logic and connections, functionality may be
configured using control signals at run time
– Much faster than LUT-based implementation (ASIC
vs. LUT)
21
Example: Xilinx DSP48E Slice
§25x18 signed multiplier
§48-bit add/subtract/accumulate
§48-bit logic operations
§SIMD operations (12/24 bit)
§Pipeline registers for high speed
[source: AMD Xilinx] 22
Dedicated Block RAMs (BRAMs)
▸Example: Xilinx 18K/36K
block RAMs 18K/36K block RAM
DIA
– 32k x 1 to 512 x 72 in one DIPA
ADDRA
36K block WEA
ENA
– Simple dual-port and true DOA
CLKA DOPA
dual-port configurations
DIB
– Built-in FIFO logic DIPB
ADDRB
– 64-bit error correction WEB
ENB
coding per 36K block DOB
CLKB DOPB
[source: AMD Xilinx] 23
An Embedded FPGA SoC
Dual ARM Cortex-A9 + NEON Up to
SIMD extension @600MHz~1GHz 350K logic cells
2MB Block RAM
900 DSP48s
Xilinx Zynq All Programmable System-on-Chip
[Source: AMD Xilinx] 24
A Cloud FPGA Instance
Block RAM
Block RAM
~2 Million ~5000 ~300Mb
Logic Blocks DSP Blocks Block RAM
AWS F1 instance: AMD Xilinx UltraScale+ VU9P
[Figure source: David Pellerin, AWS]
25
An Even More Heterogeneous (FPGA) Accelerator
Versal Architecture Overview
AMD Xilinx Versal Architecture
Adaptable Engines
2X compute density
Intelligent Engines
Scalar Engines • AI Compute
• Platform Control • Diverse DSP workloads
• Edge Compute
Network-on-Chip
Protocol Engines • Guaranteed Bandwidth
• Integrated 600G cores • Enables SW Programmability
• 4X encrypted bandwidth
Programmable I/O DDR Memory
• Any sensor, any interface • 2X bandwidth/pin
• Extendable peripheral set • Server-class density
PCIe & CCIX
Transceivers
• 2X PCIe & DMA bandwidth
• Broad range, 25G →112G
• Cache-coherent interface
• 58G in mainstream devices
to accelerators
>> 7
© Copyright 2018 Xilinx
26
[source: AMD Xilinx]
Key Advantages of FPGA-Based Computing
▸ Massive amount of fine-
grained parallelism
– Highly parallel and/or deeply
pipelined architecture
▸ Silicon (re)configurable to
fit the application
– Compute at desired numerical
accuracy
– Customized memory hierarchy
Þ low (and predictable) latency
Þ higher energy efficiency
27
Next Lecture
▸Analysis of Algorithms
28
Acknowledgements
▸ These slides contain/adapt materials developed by
– Prof. Jason Cong (UCLA)
– Andrew Boutros and Prof. Vaughn Betz (Univ. of Toronto)
– UCI CS295 by Prof. Sang-Woo Jun
29