0% found this document useful (0 votes)
37 views30 pages

Lecture04 - High-Level Digital Design Automation

The document outlines the course ECE 6775 on High-Level Digital Design Automation, focusing on Field-Programmable Gate Arrays (FPGAs) and their architectures. It includes exercises on operational intensity analysis for 2D convolution, discussions on FPGA components, and the advantages of FPGA-based computing. The agenda also mentions upcoming topics such as algorithm analysis and acknowledges contributions from various professors.

Uploaded by

leprelepre
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views30 pages

Lecture04 - High-Level Digital Design Automation

The document outlines the course ECE 6775 on High-Level Digital Design Automation, focusing on Field-Programmable Gate Arrays (FPGAs) and their architectures. It includes exercises on operational intensity analysis for 2D convolution, discussions on FPGA components, and the advantages of FPGA-based computing. The agenda also mentions upcoming topics such as algorithm analysis and acknowledges contributions from various professors.

Uploaded by

leprelepre
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

ECE 6775

High-Level Digital Design Automation


Fall 2024

Field-Programmable Gate Arrays


(FPGAs)
Announcements

▸TA-led hands-on tutorial on HLS next Tuesday


– Bring your laptop

1
Exercise: OI Analysis of 2D Convolution
C

Input
image R
frame

Estimate the OI for the 2D convolution kernel both


without and with data use

2
OI Analysis of 2D Convolution w/o LineBuffer
C

for (r = 1; r < R; r++)


for (c = 1; c < C; c++)
Input for (i = 0; i < 3; i++)
image R for (j = 0; j < 3; j++)
frame out[r][c] += img[r+i-1][c+j-1] * f[i][j];

▸ OI without data reuse


– Number of operations = C*R*9*2
(1 multiply + 1 add per pixel)
– External mem accesses (bytes) = C*R*9
(assuming 1 byte per pixel in grayscale)
– Resulting OI = 2

3
OI Analysis of 2D Convolution w/ LineBuffer
C

for (r = 1; r < R; r++)


for (c = 1; c < C; c++)
Input for (i = 0; i < 3; i++)
image R for (j = 0; j < 3; j++)
frame out[r][c] += img[r+i-1][c+j-1] * f[i][j];

▸ OI with data reuse using line buffer


– Number of operations = C*R*9*2
– External mem accesses (bytes) = C*R
– OI = 18

4
Recap: Design Space Exploration with Roofline
120 Bandwidth Roof

Computational Roof
Computational Throughput

100
A B

80 Design points A & B


achieve same throughput
60
But which one would you
prefer?
40

20

0
0 10 20 30 40 50 60

Operational Intensity (OI)


5
Agenda

▸FPGA introduction
– Basic building blocks
– Classical homogeneous FPGA architectures
– Modern heterogeneous FPGA architectures

6
Tradeoff between Compute Efficiency and Flexibility

FLEXIBILITY EFFICIENCY
Register
Contr s
ol
Unit
CPUs
Arithmet GPUs FPGAs ASICs
(CU) ic Logic
Unit
(ALU)

7
What Are FPGAs

▸ Field-programmable gate array


– Can be configured to act like any circuit after manufacturing
– Can do many things – we focus on computation acceleration

8
FPGAs Come In Many Forms

PCIe-Attached In-Storage

CPU Integrated In-Network

9
Building Blocks of Modern FPGA Architectures

▸ A programmable array of logic blocks (LUT, FF),


interconnects, I/Os, and dedicated blocks (BRAM, DSP)

Look-up table (LUT) DSP


LUT

⋮ ⋮

10
Counting Boolean Functions

▸How many distinct 2-input 1-output Boolean


functions exist?

▸What about K inputs?

11
Multiplexer as a Universal Gate

▸ Any function of k variables can be implemented with a


2k:1 multiplexer

AA BB Cin
Cin SS Cout
Cout
0 0 0 00 0 ? 0
0 0 1 11 0 ? 1
? 2
0 1 0 11 0 ? 3
4 8:1 MUX Cout
0 1 1 00 1 ?
? 5
1 0 0 11 0 ? 6
7
?
1 0 1 00 1 S2 S1 S0

1 1 0 00 1
? ? ?
1 1 1 11 1

12
Look-Up Table (LUT)

§ A k-input LUT (k-LUT) can be


0/1
configured to implement any k-
0/1
input 1-output combinational
0/1
logic

MUX
0/1
– 2k SRAM bits Y
0/1
– Delay is independent of logic
function 0/1
0/1
0/1

x2 x1 x0

A 3-input LUT
13
Exercise: Implementing Logic with LUTs

▸Implement a 2:1 MUX using a network of 2-input


LUTs. Use the minimum number of LUTs

I0
MUX

Y 2-LUT
I1

Building block:
S 2-input LUT

14
A Logic Element

▸ A k-input LUT is usually followed by a flip-flop (FF) that


can be bypassed
▸ The LUT and FF combined form a logic element

15
A Logic Block

▸ A logic block clusters


multiple logic elements

16
Arithmetic Circuitry in Logic Block

Xilinx (now AMD) Intel/Altera


LUTs implement carry propagate and LUTs pass inputs to hardened adders
generation logic

17
Routing Architecture

vs.

Hierarchical routing architecture Island-style routing architecture

18
Traditional Homogeneous FPGA Architecture

Switch
block

Routing
track

Logic
block

19
Modern Heterogeneous Field-Programmable
System-on-Chip (SoC)

▸ Island-style configurable mesh routing


▸ Lots of dedicated components
– Memories/multipliers, I/Os, processors
– Specialization leads to higher performance and lower power

20
[Figure credit: [Link]]
Dedicated DSP Blocks

▸Built-in components for fast arithmetic operation


optimized for DSP applications
– Essentially a multiply-accumulate core with many
other features

– Fixed logic and connections, functionality may be


configured using control signals at run time

– Much faster than LUT-based implementation (ASIC


vs. LUT)

21
Example: Xilinx DSP48E Slice

§25x18 signed multiplier


§48-bit add/subtract/accumulate
§48-bit logic operations
§SIMD operations (12/24 bit)
§Pipeline registers for high speed

[source: AMD Xilinx] 22


Dedicated Block RAMs (BRAMs)

▸Example: Xilinx 18K/36K


block RAMs 18K/36K block RAM
DIA
– 32k x 1 to 512 x 72 in one DIPA
ADDRA
36K block WEA
ENA
– Simple dual-port and true DOA
CLKA DOPA
dual-port configurations
DIB
– Built-in FIFO logic DIPB
ADDRB
– 64-bit error correction WEB
ENB
coding per 36K block DOB
CLKB DOPB

[source: AMD Xilinx] 23


An Embedded FPGA SoC
Dual ARM Cortex-A9 + NEON Up to
SIMD extension @600MHz~1GHz 350K logic cells
2MB Block RAM
900 DSP48s

Xilinx Zynq All Programmable System-on-Chip


[Source: AMD Xilinx] 24
A Cloud FPGA Instance

Block RAM

Block RAM
~2 Million ~5000 ~300Mb
Logic Blocks DSP Blocks Block RAM

AWS F1 instance: AMD Xilinx UltraScale+ VU9P


[Figure source: David Pellerin, AWS]

25
An Even More Heterogeneous (FPGA) Accelerator

Versal Architecture Overview


AMD Xilinx Versal Architecture
Adaptable Engines
2X compute density

Intelligent Engines
Scalar Engines • AI Compute
• Platform Control • Diverse DSP workloads
• Edge Compute

Network-on-Chip
Protocol Engines • Guaranteed Bandwidth
• Integrated 600G cores • Enables SW Programmability
• 4X encrypted bandwidth

Programmable I/O DDR Memory


• Any sensor, any interface • 2X bandwidth/pin
• Extendable peripheral set • Server-class density

PCIe & CCIX


Transceivers
• 2X PCIe & DMA bandwidth
• Broad range, 25G →112G
• Cache-coherent interface
• 58G in mainstream devices
to accelerators

>> 7
© Copyright 2018 Xilinx

26
[source: AMD Xilinx]
Key Advantages of FPGA-Based Computing

▸ Massive amount of fine-


grained parallelism
– Highly parallel and/or deeply
pipelined architecture
▸ Silicon (re)configurable to
fit the application
– Compute at desired numerical
accuracy
– Customized memory hierarchy

Þ low (and predictable) latency


Þ higher energy efficiency

27
Next Lecture

▸Analysis of Algorithms

28
Acknowledgements

▸ These slides contain/adapt materials developed by


– Prof. Jason Cong (UCLA)
– Andrew Boutros and Prof. Vaughn Betz (Univ. of Toronto)
– UCI CS295 by Prof. Sang-Woo Jun

29

You might also like