0% found this document useful (0 votes)

76 views105 pages

1605808992-Using oneAPI FPGA IXPUG

Uploaded by

Fábio da Silva Santana

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

76 views105 pages

1605808992-Using oneAPI FPGA IXPUG

Uploaded by

Fábio da Silva Santana

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

®

Using Intel oneAPI Toolkits

with FPGAs*
Prof. Ricardo Menotti ([email protected])
Federal University of Sao Carlos (UFSCar)

*Special thanks to Susannah Martin for the material and support

Tutorial Objectives

▪ Learn the basics of writing Data Parallel C++ programs

▪ Understand the development flow for FPGAs with the Intel® oneAPI toolkits
▪ Gain an understanding of common optimization methods for FPGAs
▪ …

2
TUTORIAL AGENDA
The Basics
The oneAPI Toolset
Introduction to Data Parallel C++
Lab: Overview of DPC++

Using FPGAs with the Intel® oneAPI Toolkits

What are FPGAs and Why Should I Care About Programming Them?
Development Flow for Using FPGAs with the Intel® oneAPI Toolkits
Lab: Practice the FPGA Development Flow

Optimizing Your Code for FPGAs

Introduction to Optimizing FPGAs with the Intel oneAPI Toolkits
Lab: Optimizing the Hough Transform Kernel

3
KERNEL Model

parallel_for( num_work_items )
• Execute kernel in parallel over a 1, 2, or 3 dimensional index space
• Work-item can query ID and range of invocation (num_work_items)
1
myQueue.submit([&](handler & cgh) { 6
stream os(1024, 80, cgh);

cgh.parallel_for<class myKernel>(range<1>(6), Output: Can communicate

[=] (id<1> index) { id<1>{ 0 }
os << index << "\n"; id<1>{ 1 } execution across
}); id<1>{ 2 } ND-Range
}); id<1>{ 3 }
id<1>{ 4 } Sub-group is a DPC++
id<1>{ 5 } extension.

Optimization Notice
Copyright © 2019, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
4
KERNEL Model

single_task(
)• Similar to CPU code with an outer loop
• Allows many-staged custom hardware to be built in
an FPGA
1
myQueue.submit([&](handler & cgh) { 6
stream os(1024, 80, cgh);

cgh.single_task<class myKernel>([=] () { Output: A custom hardware datapath

for (int i=0;i<NUM_ELEMENTS;i++) { 0 can be generated in an FPGA
os << i << "\n"; 1
} 2
for complex single_task kernels
}); 3
}); 4
5

Optimization Notice
Copyright © 2019, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
5
How it maps to CPU, GPU, FPGA
DSP
Block
Memory
Block

CPU GPU
• MULTI-CORE • MULTI-CORE
• MULTI-THREADED
FPGA
• MULTI-THREADED
• SIMD • Custom Pipeline
• SIMD
• PIPELINED • MULTI-CORE (pipeline)
Optimization Notice
• PIPELINED
Copyright © 2019, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
6
What are FPGAs and Why Should
I Care About Programming Them?
A Brief Introduction
What is an FPGA?
First, let’s define the acronym. It’s
a Field-Programmable Gate Array.

8
“Field-Programmable Gate Array” (FPGA)

▪ “Gates” refers to logic gates, implemented with transistors

– These are the tiny pieces of hardware on a chip that make up the design

▪ “Array” means there are many of them manufactured on the chip

– (Many = Billions) They are arranged into larger structures as we will see

▪ “Field-Programmable” means the connections between the internal

components are programmable after deployment

FPGA = Programmable Hardware

Reconfigurable Computing
9
How an FPGA Becomes What You Want It To Be

The FPGA is made up of small building

blocks of logic and other functions

10
How an FPGA Becomes What You Want It To Be

The FPGA is made up of small building

blocks of logic and other functions

▪ The building blocks you choose

11
How an FPGA Becomes What You Want It To Be

The FPGA is made up of small building

blocks of logic and other functions

▪ The building blocks you choose

▪ How you configure them

12
How an FPGA Becomes What You Want It To Be

The FPGA is made up of small building

blocks of logic and other functions

▪ The building blocks you choose

▪ How you configure them
▪ And how you connect them

Determine what function the FPGA

performs

13
Blocks Used to Build What You’ve Coded

Custom
XOR

Custom state
machine

Look-up Tables
and Registers Custom 64-bit
bit-shuffle and encode

14
Blocks Used to Build What You’ve Coded

addr Memory
Block data_out
data_in Larger
20 Kb Small memories
memories

On-chip RAM
Blocks

15
Blocks Used to Build What You’ve Coded

Custom
Math
Functions
DSP Blocks

16
Then, It’s All Connected
Together

Blocks are connected with

custom routing
determined by your code

17
What About Connecting to the Host?

Accelerated functions run on a PCIe

attached FPGA card

The host interface is also “baked in” to the

FPGA design.

This portion of the design is pre-built and

not dependent on your source code.

18
®
Intel FPGAs Available

19
Why should I care about
programming for an FPGA?
It all comes down to the
advantage of custom hardware.

20
First, some impressive
examples…

21
Sample FPGA Workloads

22
Code to Hardware: An
Introduction

23
®
Intel FPGAs Host Link I/O Memory Interface

Pre-Compiled BSP
CCP
On-chip Memory
CCP
On-chip Memory

Implementing Optimized CCP CCP CCP

Custom Compute CCP

On-chip Memory
On-chip
Memory
On-chip
Memory
On-chip
Memory

Pipelines (CCPs)
synthesized from
compiled code

Custom Compute Pipeline

24
How Is a Pipeline Built?

Hardware is added for Load Load

▪ Computation
▪ Memory Loads and Stores + Loop
Control
▪ Control and scheduling
– Loops & Conditionals Store
for (int i=0; i<LIMIT; i++) {
c[i] = a[i] + b[i]; Data Path
} Control Path

25
Connecting the Pipeline Together

▪ Handshaking signals for variable

latency paths
a b
▪ Operations with a fixed latency
are clustered together
c
▪ Fixed latency operations improve
– Area: no handshaking signals
required
– Performance: no potential stalling d
due to variable latencies

26
Simultaneous Independent Operations

▪ The compiler automatically identifies c = a + b;

independent operations f = d * e;
▪ Simultaneous hardware is built to
increase performance a b d e
▪ This achieves data parallelism in a
+
manner similar to a superscalar
processor
c
*
f
▪ Number of independent operations
only bounded by the amount of
hardware

27
On-Chip Memories Built for Kernel Scope Variables
//kernel scope
cgh.single_task<>([=]() {

▪ Custom on-chip memory int arr[1024];

…
arr[i] = …; //store to memory
structures are built for the …
variables declared with the kernel … = arr[j] //load from memory
…
scope } //end kernel scope

32-bits
▪ Or, for memory accessors with a
target of local Pipeline
.
. On-chip
▪ Load and store units to the .
.
memory
structure 1024
on-chip memory will be built .
Store
. for array
. arr
within the pipeline .
Load
.

28
Pipeline Parallelism for Single Work-Item Kernels
handle.single_task<>([=]() {
… //accessor setup
▪ Single work-item kernels almost for (int i=0; i<LIMIT; i++) {
always contain an outer loop c[i] += a[i] + b[i];
▪ Work executing in multiple stages of }
the pipeline is called “pipeline }); Load Load
parallelism”
▪ Pipelines from real-world code are
normally hundreds of stages long +
▪ Your job is to keep the
data flowing efficiently Store

29
Key Concept
Dependencies Within the Single Custom built-in dependencies
make FPGAs powerful for
Work-Item Kernel many algorithms

When a dependency in a single

work-item kernel can be resolved by
Load Load
creating a path within the pipeline,
the compiler will build that in.
+
handle.single_task<>([=]() {
int b = 0;
for (int i=0; i<LIMIT; i++) {
Store
b += a[i];
}
});

30
How Do I Use Tasks and Still Get Data Parallelism?

The most common technique is to unroll your loops

handle.single_task<>([=]() {
… //accessor setup Iteration Stage 1 Stage 2 Stage 3
1
#pragma unroll
for (int i=1; i<3; i++) { Iteration
2
Stage 1 Stage 2 Stage 3

c[i] += a[i] + b[i];

Iteration Stage 1 Stage 2 Stage 3
} 3

});

31
Unrolled Loops Still
Get Pipelined Iteration
1
Stage 1 Stage 2 Stage 3

Iteration Stage 1 Stage 2 Stage 3

The compiler will still pipeline an Iteration

3
Stage 1 Stage 2 Stage 3

unrolled loop, combining the two Iteration Stage 1 Stage 2 Stage 3

techniques 4

Iteration Stage 1 Stage 2 Stage 3

– A fully unrolled loop will not be pipelined 5
since all iterations will kick off at once
Iteration Stage 1 Stage 2 Stage 3
6
handle.single_task<>([=]() {
… //accessor setup Iteration Stage 1 Stage 2 Stage 3
7
#pragma unroll 3
for (int i=1; i<9; i++) { Iteration Stage 1 Stage 2 Stage 3
8
c[i] += a[i] + b[i];
Iteration
} 9
Stage 1 Stage 2 Stage 3

});

32
What About Task Parallelism?

FPGAs can run more than one kernel Representation of Gzip FPGA example
included with the Intel oneAPI Base Toolkit
at a time
– The limit to how many independent kernels
can run is the amount of resources
available to build the kernels

Data can be passed between kernels

using pipes
– Another great FPGA feature explained in
the Intel® oneAPI DPC++ FPGA
Optimization Guide

33
So, Can We Build These? NDRange Kernels

▪ Kernels launched parallel_for() or parallel_for_work_group() with a

NDRange/work-group size of >1
Yes, no problem,
and you will learn
…//application scope to code them!
queue.submit([&](handler &cgh) {
auto A = A_buf.get_access<access::mode::read>(cgh); But, tasks usually
auto B = B_buf.get_access<access::mode::read>(cgh); imply more optimal
auto C = C_buf.get_access<access::mode::write>(cgh); pipeline structures.
cgh.parallel_for<class VectorAdd>(num_items, [=](id<1> wiID) {
c[wiID] = a[wiID] + b[wiID]; The loop
}); optimizations are
});
limited for
NDRange kernels.
…//application scope
34
Development Flow for Using
®
FPGAs with the Intel oneAPI
Toolkits
FPGA Development Flow with oneAPI

▪ FPGA Emulator target

(Emulation)
– Compiles in seconds
– Runs completely on the host

▪ Optimization report generation

– Compiles in seconds to minutes Long Compile!!!
– Identify bottlenecks

▪ FPGA bitstream compilation

– Compiles in hours
– Enable profiler to get runtime
analysis

36
Anatomy of a Compiler Command Targeting FPGAs

dpcpp –fintelfpga .cpp/.o [device link options] [-Xs arguments]

Target Platform
FPGA-Specific
Link Options
Arguments
Language
Input Files
DPCPP = Data
source or object
Parallel C++

37
Emulation

Get it Functional

Does my code give me the

correct answers?

38
Emulation

▪ Quickly generate x86 executables that represent the kernel

▪ Debug support for
– Standard DPC++ syntax, channels, print statements

dpcpp -fintelfpga <source_file>.cpp –DFPGA_EMULATOR

mycode.cpp ./mycode.emu
dpcpp …
Running …
Compiler

39
Explicit Selection of Emulation Device

dpcpp -fintelfpga <source_file>.cpp –DFPGA_EMULATOR

▪ Declare the device_selector as #include <CL/sycl/intel/fpga_extensions.hpp>

using namespace cl::sycl;
...
type cl::sycl::intel::fpga_emulator
#ifdef FPGA_EMULATOR
▪ Include fpga_extensions.hpp intel::fpga_emulator_selector device_selector;
#else
intel::fpga_selector device_selector;
▪ Include –DFPGA_EMULATOR in #endif

your compilation command queue deviceQueue(device_selector);

...

40
Using the Static Optimization Report

Get it Optimized

Where are the bottlenecks?

41
Compiling to Produce an Optimization Report
Two Step Method:
dpcpp -fintelfpga <source_file>.cpp -c -o <file_name>.o
dpcpp -fintelfpga <file_name>.o -fsycl-link -Xshardware

One Step Method:

dpcpp -fintelfpga <source_file>.cpp -fsycl-link -Xshardware
The default value for –fsycl-link is -fsycl-link=early
which produces an early image object file and report

A report showing optimization, area, and architectural information will be

produced in <file_name>.prj/reports/
– We will discuss more about the report later

42
FPGA Bitstream Compilation

Check Runtime Behavior

Check what you can’t check

during static analysis

43
Compile to FPGA Executable with Profiler Enabled
Two Step Method:
dpcpp -fintelfpga <source_file>.cpp -c -o <file_name>.o
dpcpp -fintelfpga <file_name>.o –Xshardware -Xsprofile

One Step Method:

dpcpp -fintelfpga <source_file>.cpp –Xshardware -Xsprofile

The profiler will be instrumented within the image and you will be able to run the
executable to return information to import into Intel® Vtune Amplifier.
To compile to FPGA executable without profiler, leave off –Xsprofile.
44
Compiling FPGA Device Separately and Linking

▪ In the default case, the DPC++ Compiler handles generating the host
executable, device image, and final executable
▪ It is sometimes desirable to compile the host and device separately so
changes in the host code do not trigger a long compile

Partition code This is the long

compile

has_kernel.cpp Then run this command to compile the FPGA image:

dpcpp -fintelfpga has_kernel.cpp –fsycl-link=image –o has_kernel.o –Xshardware
This command to produce an object file out of the host only code:
dpcpp -fintelfpga has_kernel.cpp –c –o host_only.o
host_only.cpp
This command to put the object files together into an executable:
dpcpp -fintelfpga has_kernel.o host_only.o –o a.out –Xshardware

45
References and Resources
▪ Website hub for using FPGAs with oneAPI
– https://software.intel.com/content/www/us/en/develop/tools/oneapi/components/fpga.
html

▪ Intel® oneAPI Programming Guide

– https://software.intel.com/content/www/us/en/develop/download/intel-oneapi-program
ming-guide.html

▪ Intel® oneAPI DPC++ FPGA Optimization Guide

– https://software.intel.com/content/www/us/en/develop/download/oneapi-fpga-optimiz
ation-guide.html

▪ FPGA Tutorials GitHub

– https://github.com/intel/BaseKit-code-samples/tree/master/FPGATutorials
46
Lab: Practice the FPGA
Development Flow
Introduction to Optimizing FPGAs
with the Intel oneAPI Toolkits
Agenda

▪ Reports
▪ Loop Optimization
▪ Memory Optimization
▪ Other Optimization Techniques
▪ Lab: Optimizing the Hough Transform Kernel

49
Reports

50
HTML Report

Static report showing optimization, area, and architectural

information
▪ Automatically generated with the object file
– Located in <file_name>.prj\reports\report.html
▪ Dynamic reference information to original source code

51
Optimization Report – Throughput Analysis

▪ Loops Analysis and Fmax II sections

▪ Actionable feedback on pipeline

status of loops

▪ Show estimated Fmax of each loop

52
Optimization Report – Area Analysis

Generate detailed estimated area

utilization report of kernel scope code
▪ Detailed breakdown of resources by
system blocks
▪ Provides architectural details of HW
– Suggestions to resolve inefficiencies

53
Optimization Report – Graph Viewer

▪ The system view of the

Graph Viewer shows
following types of
connections
– Control
– Memory, if your design has
global or local memory
– Pipes, if your design uses
pipes

54
Optimization Report – Schedule Viewer

▪ Schedule in clock
cycles for different
blocks in your code

55
HTML Kernel Memory Viewer

Helps you identify data movement

bottlenecks in your kernel design. Illustrates:
▪ Memory replication
▪ Banking
▪ Implemented arbitration
▪ Read/write capabilities of each memory
port

56
Profiler

▪ Inserts counters and

profiling logic into the HW CCU
design
▪ Dynamically reports the Load Load Memory Mapped
performance of kernels Registers

▪ Enable using the + To Host

–Xsprofile option with
dpcpp
Store

57
Collecting Profiling Data

▪ Run the executable that integrates the kernel with the profiler using

aocl profile -s <path/to/source>.source /path/to/host-executable

▪ A profile.json file will be produced

▪ Import the profile.json file into the Intel® Vtune™ Profiler

58
Importing Profile Data into Intel® Vtune™ Profiler

▪ Place the collect profile.json file in a folder by itself

▪ Open the Intel Vtune Profiler using the command vtune-gui
▪ Press the Import button at the top of the GUI

▪ Select Import raw trace data

▪ Navigate to the folder in the file browser (do not click into folder), and Open
▪ Click the Blue Import button in the GUI

59
Loop Optimization

60
Types of Kernels (Review)

▪ There are two types of kernels in Data Parallel C++

– Single work-item
– Parallel

▪ For FPGAs, the compiler will automatically detect the kind of kernel input
▪ Loop analysis will only be done for single work-item kernels
▪ Most loop optimizations will only apply to single work-item kernels
▪ Most optimized FPGA kernels are single work-item kernels

61
Single Work-Item Kernels

▪ Single work items kernels are …//application scope

kernels that contain no reference queue.submit([&](handler &cgh) {
to the work item ID. auto A = A_buf.get_access<access::mode::read>(cgh);
auto B = B_buf.get_access<access::mode::read>(cgh);
▪ Usually launched with the group auto C = C_buf.get_access<access::mode::write>(cgh);
handler member function
cgh.single_task<class swi_add>([=]() {
single_task(). for (unsigned i = 0; i < 128; i++) {
c[i] = a[i] + b[i];
▪ Or, launched with other functions }
and given a work-group/NDRange });
size of 1.
});
▪ Almost always contain an outer …//application scope
loop.

62
NDRange Kernels

▪ Kernels launched with the command group handler member function parallel_for() or
parallel_for_work_group() with a NDRange/work-group size of >1.
▪ Much of this section will not apply to NDRange kernels
…//application scope

queue.submit([&](handler &cgh) {
auto A = A_buf.get_access<access::mode::read>(cgh);
auto B = B_buf.get_access<access::mode::read>(cgh);
auto C = C_buf.get_access<access::mode::write>(cgh);

cgh.parallel_for<class VectorAdd>(num_items, [=](id<1> wiID) {

c[wiID] = a[wiID] + b[wiID];
});

});

…//application scope
63
Understanding Initiation Interval

▪ dpcpp will infer pipelined parallel 1 load a load b 1

execution across loop iterations
– Different stages of pipeline will ideally
contain different loop iterations

▪ Best case is that a new piece of data c=a+b

enters the pipeline each clock cycle
…
cgh.single_task<class swi_add>([=]() {
for (unsigned i = 0; i < 128; i++) {
c[i] = a[i] + b[i];
store c
}
});
… n - Iteration number

64
Understanding Initiation Interval

▪ dpcpp will infer pipelined parallel 1

2 load a load b 2
1
execution across loop iterations
– Different stages of pipeline will ideally
contain different loop iterations

▪ Best case is that a new piece of data c=a+b 1

enters the pipeline each clock cycle
…
cgh.single_task<class swi_add>([=]() {
for (unsigned i = 0; i < 128; i++) {
c[i] = a[i] + b[i];
store c
}
});
… n - Iteration number

65
Understanding Initiation Interval

▪ dpcpp will infer pipelined parallel 1

3
2 load a load b 3
2
1
execution across loop iterations
– Different stages of pipeline will ideally
contain different loop iterations

▪ Best case is that a new piece of data c=a+b 1

2
enters the pipeline each clock cycle
…
cgh.single_task<class swi_add>([=]() {
for (unsigned i = 0; i < 128; i++) { 1
c[i] = a[i] + b[i];
store c
}
});
… n - Iteration number

66
Loop Pipelining vs Serial Execution

Serial execution is the worst case. One iteration needs to complete fully before
a new piece of data enters the pipeline.

Worst Case Best Case

i1 For Begin
i2
i3i2
Op 1 i2
Op 1

Op 2 i1
Op 2

Op 3 i0
Op 3

For End i0

67
In-Between Scenario

▪ Sometimes you must wait more than one 0

clock cycle to input more data …

L=K v
…
▪ Because dependencies can’t resolve fast …
enough …
▪ How long you have to wait is called …
Initiation Interval or II II = 6
▪ Total number of cycles to run kernel is
6 cycles later, next
iteration enter the
about (loop iterations)*II loop body
– (neglects initial latency)

▪ Minimizing II is key to performance

68
Why Could This Happen?

▪ Memory Dependency
– Kernel cannot retrieve
data fast enough from
memory

_accumulators[(THETAS*(rho+RHOS))+theta] += increment;
Value must be retrieved from global memory
and incremented

69
What Can You Do? Use Local Memory
…
Transfer global memory contents to local constexpr int N = 128;
queue.submit([&](handler &cgh) {
memory before operating on the data auto A =
A_buf.get_access<access::mode::read_write>(cgh);

cgh.single_task<class optimized>([=]() {
… int B[N];
constexpr int N = 128;
queue.submit([&](handler &cgh) { for (unsigned i = 0; i < N; i++)
auto A = B[i] = A[i];
A_buf.get_access<access::mode::read_write>(cgh);
for (unsigned i = 0; i < N; i++)
cgh.single_task<class unoptimized>([=]() { B[N-i] = B[i];
for (unsigned i = 0; i < N; i++)
A[N-i] = A[i]; for (unsigned i = 0; i < N; i++)
} A[i] = B[i];
}); });

}); Non-optimized }); Optimized

… …

70
What Can You Do? Tell the Compiler About Independence

▪ [[intelfpga::ivdep]]
– Dependencies ignored for all accesses to memory arrays
[[intelfpga::ivdep]]
for (unsigned i = 1; i < N; i++) {
A[i] = A[i – X[i]]; Dependency ignored for A and B array
B[i] = B[i – Y[i]];
}

▪ [[intelfpga::ivdep(array_name)]]
– Dependency ignored for only array_name accesses

[[intelfpga::ivdep(A)]]
for (unsigned i = 1; i < N; i++) { Dependency ignored for A array
A[i] = A[i – X[i]]; Dependency for B still enforced
B[i] = B[i – Y[i]];
}

71
Why Else Could This Happen?

▪ Data Dependency
– Kernel cannot complete a
calculation fast enough

r_int[k] = ((a_int[k] / b_int[k]) / a_int[1]) / r_int[k-1];

Difficult double precision floating point
operation must be completed

72
What Can You Do?

▪ Do a simpler calculation
▪ Pre-calculate some of the operations on the host
▪ Use a simpler type
▪ Use floating point optimizations (discussed later)
▪ Advanced technique: Increase time (pipeline stages)
between start of calculation and when you use answer
– See the “Relax Loop-Carried Dependency” in the Optimization Guide for
more information
73
How Else to Optimize a
Loop? Loop Unrolling Iteration
1
Stage 1 Stage 2 Stage 3

Iteration Stage 1 Stage 2 Stage 3

The compiler will still pipeline an Iteration

3
Stage 1 Stage 2 Stage 3

unrolled loop, combining the two Iteration Stage 1 Stage 2 Stage 3

techniques 4

Iteration Stage 1 Stage 2 Stage 3

});

74
Fmax

▪ The clock frequency the FPGA will be clocked at depends on what hardware
your kernel compiles into
▪ More complicated hardware cannot run as fast
▪ The whole kernel will have one clock
▪ The compiler’s heuristic is to sacrifice clock frequency to get a higher II

A slow operation can slow down your entire kernel by

lowering the clock frequency

75
How Can You Tell This Is a Problem?

Fmax II report tells you the

target frequency for each loop in
your code.

cgh.single_task<example>([=]() {
int res = N;
#pragma unroll 8
for (int i = 0; i < N; i++) {
res += 1;
res ^= i;
}
acc_data[0] = res;
});

76
What Can You Do?

▪ Make the calculation simpler

▪ Tell the compiler you’d like to change the trade off between
II and Fmax
– Attribute placed on the line before the loop
– Set to a higher II than what the loop currently has
[[intelfpga::ii(n)]]

77
Area

The compiler sacrifices area in order to improve loop performance. What if you
would like to save on the area in some parts of your design?
▪ Give up II for less area
– Set the II higher than what compiler result is
[[intelfpga::ii(n)]]
▪ Give up loop throughput for area
– Compiler increases loop concurrency to achieve greater throughput
– Set the max_concurrency value lower than what the compiler result is

[[intelfpga::max_concurrency(n)]]
78
Memory Optimization

79
Memory Model Kernel

Global Memory
▪ Private Memory
– On-chip memory, unique to Workgroup
Workgroup
work-item These are the same Workgroup
Local Memory
Workgroup
for single_task Local Memory
▪ Local Memory kernels
Local Memory
Local Memory
Work-ite Work-ite
– On-chip memory, shared within Work-ite Work-ite
m m
Work-ite Work-ite
m m
Work-ite Work-ite
workgroup m
m
m
m
Private Private
Private Memory
Private
▪ Global Memory
Memory Private Private
Memory Private
Memory Private
Memory Memory
Memory Memory
– Visible to all workgroups

80
Understanding Board Memory Resources

Memory Type Physical Latency Throughput Capacity

Implementation for random access (GB/s) (MB)
(clock cycles)
Global DDR 240 34.133 8000
Local On-chip RAM 2 ~8000 66
Private On-chip RAM / 2/1 ~240 0.2
Registers

Key takeaway: many times the solution for a bottleneck caused by slow
memory access will be to use local memory instead of global

81
Global Memory Access is Slow – What to Do? (4)
…
We’ve seen this before... This will appear as a constexpr int N = 128;
memory dependency problem queue.submit([&](handler &cgh) {
auto A =
Transfer global memory contents to local A_buf.get_access<access::mode::read_write>(cgh);

memory before operating on the data cgh.single_task<class optimized>([=]() {

… int B[N];
constexpr int N = 128;
queue.submit([&](handler &cgh) { for (unsigned i = 0; i < N; i++)
auto A = B[i] = A[i];
A_buf.get_access<access::mode::read_write>(cgh);
for (unsigned i = 0; i < N; i++)
cgh.single_task<class unoptimized>([=]() { B[N-i] = B[i];
for (unsigned i = 0; i < N; i++)
A[N-i] = A[i]; for (unsigned i = 0; i < N; i++)
} A[i] = B[i];
}); });

}); Non-optimized }); Optimized

… …

82
Local Memory Bottlenecks

If more load and store points want to

access the local memory than there
are ports available, arbiters will be
added
These can stall, so are a potential
bottleneck
Show up in red in the Memory Viewer
section of the optimization report

83
Local Memory Bottlenecks

M20K
port 0 M20K

M20K
Local Memory Interconnect
Kernel Pipeline port 1 M20K

M20K

Natively, the memory architecture has 2 ports

The compiler optimizes memory accesses to map to these without arbitration
Your job is to write code the compiler can optimize

84
Double-Pumped Memory Example

Increase the clock rate to 2x

Compiler can automatically implement
double-pumped memory – turning 2 ports to 4
//kernel scope
...
int array[1024];

array[ind1] = val;

array[ind1+1] = val;

calc = array[ind2] + array[ind2+1];

…

85
Local Memory Replication Example
//kernel scope
…
int array[1024];
int res = 0;

ST array[ind1] = val;
#pragma unroll
for (int i = 0; i < 9; i++)
LD res += array[ind2+i];

calc = res;
…

Turn 4 ports of double-pumped memory to unlimited ports

Drawbacks: logic resources, stores must go to each replication

86
Coalescing
//kernel scope
…
local int array[1024];
int res = 0;

#pragma unroll
for (int i = 0; i < 4; i++)
array[ind1*4 + i] = val;

#pragma unroll
for (int i = 0; i < 4; i++)
res += array[ind2*4 + i];

calc = res;
…

Continuous addresses can be

coalesced into wider accesses

87
Banking

Divide the memory into independent fractional pieces (banks)

//kernel scope
…
int array[1024][2];

array[ind1][0] = val1;
array[ind2][1] = val2;

calc = (array[ind2][0] +
array[ind1][1]);
…
Compiler looks at lower indices by default
Indices for banking must be a power of 2 size
88
Attributes for Local Memory Optimization
Note: Let the compiler try on it’s own first. It’s very good at inferring an optimal structure!

Attribute Usage
numbanks [[intelfpga::numbanks(N)]]
bankwidth [[intelfpga::bankwidth(N)]]
singlepump [[intelfpga::singlepump]]
doublepump [[intelfpga::doublepump]]
max_replicates [[intelfpga::max_replicates(N)]]
simple_dual_port [[intelfpga::simple_dual_port]]
Note: This is not a comprehensive list. Consult the Optimization Guide for more.

89
Pipes – Element the Need for Some Memory

Create custom direct point-to-point communication between

CCPs with Pipes

Global Memory
Re

rite
ad

W
CCP 1 Pipe CCP 2 Pipe CCP 3

90
Task Parallelism By Using Pipes

Launch separate kernels simultaneously

Achieve synchronization and data sharing using pipes
Make better use of your hardware

91
Lab: Optimizing the Hough
Transform Kernel
Other Optimization Techniques

93
Avoid Expensive Functions

▪ Expensive functions take a lot of hardware and run slow

▪ Examples
– Integer division and modulo (remainder) operators
– Most floating-point operations except addition, multiplication,
absolution, and comparison
– Atomic functions

94
Inexpensive Functions

▪ Use instead of expensive functions whenever possible

– Minimal effects on kernel performance
– Consumes minimal hardware

▪ Examples
– Binary logic operations such as AND, NAND, OR, NOR, XOR, and XNOR
– Logical operations with one constant argument
– Shift by constant
– Integer multiplication and division by a constant that is to the power of 2
– Bit swapping (Endian adjustment)

95
Use Least-“Expensive” Data Type

▪ Understand cost of each data type in latency and logic usage

– Logic usage may be > 4x for double vs. float operations
– Latency may be much larger for float and double operations compared to fixed point
types

▪ Measure or restrict the range and precision (if possible)

– Be familiar with the width, range and precision of data types
– Use half or single precision instead of double (default)
– Use fixed point instead of floating point
– Don’t use float if short is sufficient

96
Floating-Point Optimizations

▪ Apply to half, float and double data types

▪ Optimizations will cause small differences in floating-point results

– Not IEEE Standard for Floating-Point Arithmetic (IEEE 754-2008) compliant

▪ Floating-point optimizations:
– Tree Balancing
– Reducing Rounding Operations

97
Tree-Balancing

▪ Floating-point operations are not associative

– Rounding after each operation affects the outcome
– ie. ((a+b) + c) != (a+(b+c))

▪ By default the compiler doesn’t reorder floating-point operations

– May creates an imbalance in a pipeline, costs latency and possibly area

▪ Manually enable compiler to balance operations

– For example, create a tree of floating-point additions in SGEMM, rather than a chain
– Use -Xsfp-relaxed=true flag when calling dpcpp
Rounding Operations

▪ For a series of floating-point operations, IEEE 754 require multiple rounding

operation
▪ Rounding can require significant amount of hardware resources
▪ Fused floating-point operation
– Perform only one round at the end of the tree of the floating-point operations
– Other processor architectures support certain fused instructions such as fused
multiply and accumulate (FMAC)
– Any combination of floating-point operators can be fused

▪ Use dpcpp compiler switch -Xsfpc

99
References and Resources
References and Resources
▪ Website hub for using FPGAs with oneAPI
– https://software.intel.com/content/www/us/en/develop/tools/oneapi/components/fpga.
html

▪ Intel® oneAPI Programming Guide

– https://software.intel.com/content/www/us/en/develop/download/intel-oneapi-program
ming-guide.html

▪ Intel® oneAPI DPC++ FPGA Optimization Guide

– https://software.intel.com/content/www/us/en/develop/download/oneapi-fpga-optimiz
ation-guide.html

▪ FPGA Tutorials GitHub

– https://github.com/intel/BaseKit-code-samples/tree/master/FPGATutorials
101
Upcoming Training

These online trainings are being developed throughout 2020

▪ Converting OpenCL Code to DPC++
▪ Loop Optimization for FPGAs with Intel oneAPI Toolkits
▪ Memory Optimization for FPGAs with Intel oneAPI Toolkits

…and others!

102
Legal Disclaimers/Acknowledgements

Intel technologies’ features and benefits depend on system configuration and

may require enabled hardware, software or service activation. Performance
varies depending on system configuration. Check with your system manufacturer
or retailer or learn more at www.intel.com.
Intel, the Intel logo, Intel Inside, the Intel Inside logo, MAX, Stratix, Cyclone,
Arria, Quartus, HyperFlex, Intel Atom, Intel Xeon and Enpirion are trademarks of
Intel Corporation or its subsidiaries in the U.S. and/or other countries.
OpenCL is the trademark of Apple Inc. used by permission by Khronos
*Other names and brands may be claimed as the property of others
© Intel Corporation

103
Notices & Disclaimers
This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your
Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.
The products and services described may contain defects or errors known as errata which may cause deviations from published specifications. Current characterized errata
are available on request. No product or component can be absolutely secure. Intel technologies’ features and benefits depend on system configuration and may require
enabled hardware, software or service activation. Learn more at intel.com, or from the OEM or retailer.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and
MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to
vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product
when combined with other products. For more complete information visit www.intel.com/benchmarks.
INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL
PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED
WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE,
MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Optimization Notice
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations
include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on
microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not
specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the
specific instruction sets covered by this notice.
Notice revision #20110804

Opencl On Fpga: Marc Gaucheron INTEL Programmable Solution Group
No ratings yet
Opencl On Fpga: Marc Gaucheron INTEL Programmable Solution Group
128 pages
Understanding FPGAs for Acceleration
No ratings yet
Understanding FPGAs for Acceleration
22 pages
FPGA PPT Presentation On Flow
No ratings yet
FPGA PPT Presentation On Flow
21 pages
PYNQ Productivity With Python
100% (1)
PYNQ Productivity With Python
67 pages
Test
No ratings yet
Test
496 pages
Lecture04 - High-Level Digital Design Automation
No ratings yet
Lecture04 - High-Level Digital Design Automation
30 pages
Data Processing On Fpgas
No ratings yet
Data Processing On Fpgas
12 pages
Practical Fpga Programming Inc: David Pellerin Scott Thibault
No ratings yet
Practical Fpga Programming Inc: David Pellerin Scott Thibault
6 pages
SLM - CO2 - FPGA Super Computer in High Performance Computing
No ratings yet
SLM - CO2 - FPGA Super Computer in High Performance Computing
9 pages
FPGAs Memory Synchronization and Performance Evaluation Using The Open Computing Language Framework
No ratings yet
FPGAs Memory Synchronization and Performance Evaluation Using The Open Computing Language Framework
8 pages
Ma 2004 03
No ratings yet
Ma 2004 03
186 pages
Field Programmable Gate Arrays FPGAs An Introduction
No ratings yet
Field Programmable Gate Arrays FPGAs An Introduction
8 pages
Programming and Synthesis For Software-Defined FPGA Acceleration - Status and Future Prospects
No ratings yet
Programming and Synthesis For Software-Defined FPGA Acceleration - Status and Future Prospects
39 pages
W MI Are Field Programmable Gate Arrays Ready For The Mainstream
No ratings yet
W MI Are Field Programmable Gate Arrays Ready For The Mainstream
7 pages
FPGA Architecture Course Overview
No ratings yet
FPGA Architecture Course Overview
61 pages
2017 01 31 FPGA Lecture HS
No ratings yet
2017 01 31 FPGA Lecture HS
75 pages
FPGA-Based Massively-Parallel Supercomputer
No ratings yet
FPGA-Based Massively-Parallel Supercomputer
101 pages
FPGA Technology Overview
No ratings yet
FPGA Technology Overview
15 pages
FPGA Kitap BLM
No ratings yet
FPGA Kitap BLM
30 pages
FPGA Architecture, Technologies, and Tools: Neeraj Goel IIT Delhi
No ratings yet
FPGA Architecture, Technologies, and Tools: Neeraj Goel IIT Delhi
63 pages
Image Processing Using VHDL
No ratings yet
Image Processing Using VHDL
36 pages
GRoup 6
No ratings yet
GRoup 6
8 pages
2022 06 15 FPGA Lecture HS
No ratings yet
2022 06 15 FPGA Lecture HS
79 pages
VHDL Paper
No ratings yet
VHDL Paper
32 pages
FPGAs For Software Programmers
No ratings yet
FPGAs For Software Programmers
331 pages
Thesis HardBound
No ratings yet
Thesis HardBound
227 pages
08 Architecture
No ratings yet
08 Architecture
51 pages
Customizable FPGA Overlays for Applications
No ratings yet
Customizable FPGA Overlays for Applications
7 pages
01 Fpga
No ratings yet
01 Fpga
38 pages
04 Abstract
No ratings yet
04 Abstract
40 pages
Electronics System Design Using FPGA
No ratings yet
Electronics System Design Using FPGA
15 pages
Introduction to Field Programmable Gate Arrays
No ratings yet
Introduction to Field Programmable Gate Arrays
57 pages
CO2-FPGA Super Computers in High Performance Computing
No ratings yet
CO2-FPGA Super Computers in High Performance Computing
15 pages
Week 1 (Part 2) ECE-852 Pak Austria
No ratings yet
Week 1 (Part 2) ECE-852 Pak Austria
45 pages
How To Accelerate A Simple FIR To FPGA
No ratings yet
How To Accelerate A Simple FIR To FPGA
8 pages
CESE4040 - Processor Design Project Guide
No ratings yet
CESE4040 - Processor Design Project Guide
32 pages
BITS Pilani: Reconfigurable Computing Es ZG 554 / Mel ZG 554 Session 1
No ratings yet
BITS Pilani: Reconfigurable Computing Es ZG 554 / Mel ZG 554 Session 1
23 pages
Command Handling in NetFPGA Simulators
No ratings yet
Command Handling in NetFPGA Simulators
4 pages
Embedded System Design Using FPGAs
No ratings yet
Embedded System Design Using FPGAs
15 pages
FPGA Basics for Engineers
No ratings yet
FPGA Basics for Engineers
15 pages
Name: P.Aswin Bharathi Date: 24-02-2021 Reg - Num: 9519005303 Experiment Number: 1 Study of Embedded Soc in Fpga Aim
No ratings yet
Name: P.Aswin Bharathi Date: 24-02-2021 Reg - Num: 9519005303 Experiment Number: 1 Study of Embedded Soc in Fpga Aim
13 pages
ASPLOS 2021 - Golden Age of Compilers
No ratings yet
ASPLOS 2021 - Golden Age of Compilers
64 pages
CPLDs and FPGAs Overview Guide
No ratings yet
CPLDs and FPGAs Overview Guide
11 pages
FPGA and OpenCL
No ratings yet
FPGA and OpenCL
31 pages
FPGA Basics
No ratings yet
FPGA Basics
20 pages
OpenFPGA: Rapid FPGA Prototyping Framework
No ratings yet
OpenFPGA: Rapid FPGA Prototyping Framework
4 pages
MICRO22 - FPGA - DL - Deep Learning Optimized FPGA Architectures
No ratings yet
MICRO22 - FPGA - DL - Deep Learning Optimized FPGA Architectures
230 pages
FPGA Basics: Architecture and Uses
No ratings yet
FPGA Basics: Architecture and Uses
10 pages
Jan 08 Feature
No ratings yet
Jan 08 Feature
5 pages
FPGA Low Latency Presentation Visual
No ratings yet
FPGA Low Latency Presentation Visual
10 pages
Trading Latency For Compute in The Network
No ratings yet
Trading Latency For Compute in The Network
7 pages
Lec5 FPGA
No ratings yet
Lec5 FPGA
46 pages
L1 Introduction
No ratings yet
L1 Introduction
20 pages
Parallel Processing: 6.004x Computation Structures Part 3 - Computer Organization
No ratings yet
Parallel Processing: 6.004x Computation Structures Part 3 - Computer Organization
41 pages
CV-Shahid 2025 - Juni 25
No ratings yet
CV-Shahid 2025 - Juni 25
4 pages
PSU Motherboard CPU RAM HDD Graphics Card System Casing
No ratings yet
PSU Motherboard CPU RAM HDD Graphics Card System Casing
91 pages
NEPTUNE 280 Series Double Door Maglocks UserManual LR
No ratings yet
NEPTUNE 280 Series Double Door Maglocks UserManual LR
2 pages
Kinescope Player Log 1738588782685.json
No ratings yet
Kinescope Player Log 1738588782685.json
16 pages
Lenovo G580 Compal LA-7871p Rev1.0 Schematic
No ratings yet
Lenovo G580 Compal LA-7871p Rev1.0 Schematic
44 pages
Lenovo Thinkstation p510 Quick Specs
No ratings yet
Lenovo Thinkstation p510 Quick Specs
1 page
Android App Compatibility Evolution
No ratings yet
Android App Compatibility Evolution
6 pages
20A04503T Microprocessors and Microcontrollers
No ratings yet
20A04503T Microprocessors and Microcontrollers
1 page
Understanding The Intel 8086 Microprocessor
No ratings yet
Understanding The Intel 8086 Microprocessor
5 pages
Block Diagram of Motherboard
100% (1)
Block Diagram of Motherboard
11 pages
Transformer Monitoring System Catalog: CA024009EN
No ratings yet
Transformer Monitoring System Catalog: CA024009EN
24 pages
8085 Memory and I/O Interfacing Guide
No ratings yet
8085 Memory and I/O Interfacing Guide
7 pages
Vulkan 1.0.0 Initialization Log
No ratings yet
Vulkan 1.0.0 Initialization Log
1,037 pages
FOM Question Bank
No ratings yet
FOM Question Bank
4 pages
Inspiron 24 5400 Aio - Service Manual - en Us
No ratings yet
Inspiron 24 5400 Aio - Service Manual - en Us
73 pages
LX2160ADPAA2RM
No ratings yet
LX2160ADPAA2RM
1,806 pages
User Manual STM32F103 PDF
0% (1)
User Manual STM32F103 PDF
655 pages
Pontaj Lunar SunriseSurprise
No ratings yet
Pontaj Lunar SunriseSurprise
4 pages
Aspire 5540 - 5560 - 5590
No ratings yet
Aspire 5540 - 5560 - 5590
112 pages
Keystone F89 Pneumatic Actuator Guide
No ratings yet
Keystone F89 Pneumatic Actuator Guide
11 pages
AEIE3103 Model Questions Module III-1
No ratings yet
AEIE3103 Model Questions Module III-1
3 pages
113 - Trellix CM 4600 HAG CM 4600 Hardware Administration Guide CM 4600 2025-01!10!17!05!33
No ratings yet
113 - Trellix CM 4600 HAG CM 4600 Hardware Administration Guide CM 4600 2025-01!10!17!05!33
20 pages
WASP3D
100% (1)
WASP3D
3 pages
Geh 6706
No ratings yet
Geh 6706
168 pages
Replacement of A Power Board
No ratings yet
Replacement of A Power Board
3 pages
ATT III - 9. Appropriate Use of Hand Tools, Machine Tools and Measuring Instruments For Fabrication and Repair On Board
No ratings yet
ATT III - 9. Appropriate Use of Hand Tools, Machine Tools and Measuring Instruments For Fabrication and Repair On Board
8 pages
DE10-Nano Getting Started Guide
No ratings yet
DE10-Nano Getting Started Guide
35 pages
Sign Magnitude vs. Two's Complement
No ratings yet
Sign Magnitude vs. Two's Complement
9 pages
Super Mario All-Stars + Super Mario World (USA) .Simple
No ratings yet
Super Mario All-Stars + Super Mario World (USA) .Simple
12 pages
Hubbell - Aluminum Band-Type Cluster Mounts
No ratings yet
Hubbell - Aluminum Band-Type Cluster Mounts
8 pages

1605808992-Using oneAPI FPGA IXPUG

Uploaded by

1605808992-Using oneAPI FPGA IXPUG

Uploaded by

®

Using Intel oneAPI Toolkits

*Special thanks to Susannah Martin for the material and support

▪ Learn the basics of writing Data Parallel C++ programs

Using FPGAs with the Intel® oneAPI Toolkits

Optimizing Your Code for FPGAs

cgh.parallel_for<class myKernel>(range<1>(6), Output: Can communicate

cgh.single_task<class myKernel>([=] () { Output: A custom hardware datapath

▪ “Gates” refers to logic gates, implemented with transistors

▪ “Array” means there are many of them manufactured on the chip

▪ “Field-Programmable” means the connections between the internal

FPGA = Programmable Hardware

The FPGA is made up of small building

The FPGA is made up of small building

▪ The building blocks you choose

The FPGA is made up of small building

▪ The building blocks you choose

The FPGA is made up of small building

▪ The building blocks you choose

Determine what function the FPGA

Blocks are connected with

Accelerated functions run on a PCIe

The host interface is also “baked in” to the

This portion of the design is pre-built and

Implementing Optimized CCP CCP CCP

Custom Compute CCP

Custom Compute Pipeline

Hardware is added for Load Load

▪ Handshaking signals for variable

▪ The compiler automatically identifies c = a + b;

▪ Custom on-chip memory int arr[1024];

When a dependency in a single

The most common technique is to unroll your loops

c[i] += a[i] + b[i];

Iteration Stage 1 Stage 2 Stage 3

The compiler will still pipeline an Iteration

unrolled loop, combining the two Iteration Stage 1 Stage 2 Stage 3

Iteration Stage 1 Stage 2 Stage 3

Data can be passed between kernels

▪ Kernels launched parallel_for() or parallel_for_work_group() with a

▪ FPGA Emulator target

▪ Optimization report generation

▪ FPGA bitstream compilation

dpcpp –fintelfpga *.cpp/*.o [device link options] [-Xs arguments]

Does my code give me the

▪ Quickly generate x86 executables that represent the kernel

dpcpp -fintelfpga <source_file>.cpp –DFPGA_EMULATOR

dpcpp -fintelfpga <source_file>.cpp –DFPGA_EMULATOR

▪ Declare the device_selector as #include <CL/sycl/intel/fpga_extensions.hpp>

your compilation command queue deviceQueue(device_selector);

Where are the bottlenecks?

One Step Method:

A report showing optimization, area, and architectural information will be

Check Runtime Behavior

Check what you can’t check

One Step Method:

Partition code This is the long

has_kernel.cpp Then run this command to compile the FPGA image:

▪ Intel® oneAPI Programming Guide

▪ Intel® oneAPI DPC++ FPGA Optimization Guide

▪ FPGA Tutorials GitHub

Static report showing optimization, area, and architectural

▪ Loops Analysis and Fmax II sections

▪ Actionable feedback on pipeline

▪ Show estimated Fmax of each loop

Generate detailed estimated area

▪ The system view of the

Helps you identify data movement

▪ Inserts counters and

▪ Enable using the + To Host

aocl profile -s <path/to/source>.source /path/to/host-executable

▪ A profile.json file will be produced

▪ Place the collect profile.json file in a folder by itself

▪ Select Import raw trace data

▪ There are two types of kernels in Data Parallel C++

▪ Single work items kernels are …//application scope

dpcpp –fintelfpga .cpp/.o [device link options] [-Xs arguments]