1605808992-Using oneAPI FPGA IXPUG
1605808992-Using oneAPI FPGA IXPUG
2
TUTORIAL AGENDA
The Basics
The oneAPI Toolset
Introduction to Data Parallel C++
Lab: Overview of DPC++
3
KERNEL Model
parallel_for( num_work_items )
• Execute kernel in parallel over a 1, 2, or 3 dimensional index space
• Work-item can query ID and range of invocation (num_work_items)
1
myQueue.submit([&](handler & cgh) { 6
stream os(1024, 80, cgh);
Optimization Notice
Copyright © 2019, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
4
KERNEL Model
single_task(
)• Similar to CPU code with an outer loop
• Allows many-staged custom hardware to be built in
an FPGA
1
myQueue.submit([&](handler & cgh) { 6
stream os(1024, 80, cgh);
Optimization Notice
Copyright © 2019, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
5
How it maps to CPU, GPU, FPGA
DSP
Block
Memory
Block
CPU GPU
• MULTI-CORE • MULTI-CORE
• MULTI-THREADED
FPGA
• MULTI-THREADED
• SIMD • Custom Pipeline
• SIMD
• PIPELINED • MULTI-CORE (pipeline)
Optimization Notice
• PIPELINED
Copyright © 2019, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
6
What are FPGAs and Why Should
I Care About Programming Them?
A Brief Introduction
What is an FPGA?
First, let’s define the acronym. It’s
a Field-Programmable Gate Array.
8
“Field-Programmable Gate Array” (FPGA)
10
How an FPGA Becomes What You Want It To Be
11
How an FPGA Becomes What You Want It To Be
12
How an FPGA Becomes What You Want It To Be
13
Blocks Used to Build What You’ve Coded
Custom
XOR
Custom state
machine
Look-up Tables
and Registers Custom 64-bit
bit-shuffle and encode
14
Blocks Used to Build What You’ve Coded
addr Memory
Block data_out
data_in Larger
20 Kb Small memories
memories
On-chip RAM
Blocks
15
Blocks Used to Build What You’ve Coded
Custom
Math
Functions
DSP Blocks
16
Then, It’s All Connected
Together
17
What About Connecting to the Host?
18
®
Intel FPGAs Available
19
Why should I care about
programming for an FPGA?
It all comes down to the
advantage of custom hardware.
20
First, some impressive
examples…
21
Sample FPGA Workloads
22
Code to Hardware: An
Introduction
23
®
Intel FPGAs Host Link I/O Memory Interface
Pre-Compiled BSP
CCP
On-chip Memory
CCP
On-chip Memory
Pipelines (CCPs)
synthesized from
compiled code
▪ Computation
▪ Memory Loads and Stores + Loop
Control
▪ Control and scheduling
– Loops & Conditionals Store
for (int i=0; i<LIMIT; i++) {
c[i] = a[i] + b[i]; Data Path
} Control Path
25
Connecting the Pipeline Together
26
Simultaneous Independent Operations
27
On-Chip Memories Built for Kernel Scope Variables
//kernel scope
cgh.single_task<>([=]() {
32-bits
▪ Or, for memory accessors with a
target of local Pipeline
.
. On-chip
▪ Load and store units to the .
.
memory
structure 1024
on-chip memory will be built .
Store
. for array
. arr
within the pipeline .
Load
.
28
Pipeline Parallelism for Single Work-Item Kernels
handle.single_task<>([=]() {
… //accessor setup
▪ Single work-item kernels almost for (int i=0; i<LIMIT; i++) {
always contain an outer loop c[i] += a[i] + b[i];
▪ Work executing in multiple stages of }
the pipeline is called “pipeline }); Load Load
parallelism”
▪ Pipelines from real-world code are
normally hundreds of stages long +
▪ Your job is to keep the
data flowing efficiently Store
29
Key Concept
Dependencies Within the Single Custom built-in dependencies
make FPGAs powerful for
Work-Item Kernel many algorithms
30
How Do I Use Tasks and Still Get Data Parallelism?
handle.single_task<>([=]() {
… //accessor setup Iteration Stage 1 Stage 2 Stage 3
1
#pragma unroll
for (int i=1; i<3; i++) { Iteration
2
Stage 1 Stage 2 Stage 3
});
31
Unrolled Loops Still
Get Pipelined Iteration
1
Stage 1 Stage 2 Stage 3
});
32
What About Task Parallelism?
FPGAs can run more than one kernel Representation of Gzip FPGA example
included with the Intel oneAPI Base Toolkit
at a time
– The limit to how many independent kernels
can run is the amount of resources
available to build the kernels
33
So, Can We Build These? NDRange Kernels
36
Anatomy of a Compiler Command Targeting FPGAs
Target Platform
FPGA-Specific
Link Options
Arguments
Language
Input Files
DPCPP = Data
source or object
Parallel C++
37
Emulation
Get it Functional
38
Emulation
mycode.cpp ./mycode.emu
dpcpp …
Running …
Compiler
39
Explicit Selection of Emulation Device
40
Using the Static Optimization Report
Get it Optimized
41
Compiling to Produce an Optimization Report
Two Step Method:
dpcpp -fintelfpga <source_file>.cpp -c -o <file_name>.o
dpcpp -fintelfpga <file_name>.o -fsycl-link -Xshardware
42
FPGA Bitstream Compilation
43
Compile to FPGA Executable with Profiler Enabled
Two Step Method:
dpcpp -fintelfpga <source_file>.cpp -c -o <file_name>.o
dpcpp -fintelfpga <file_name>.o –Xshardware -Xsprofile
The profiler will be instrumented within the image and you will be able to run the
executable to return information to import into Intel® Vtune Amplifier.
To compile to FPGA executable without profiler, leave off –Xsprofile.
44
Compiling FPGA Device Separately and Linking
▪ In the default case, the DPC++ Compiler handles generating the host
executable, device image, and final executable
▪ It is sometimes desirable to compile the host and device separately so
changes in the host code do not trigger a long compile
45
References and Resources
▪ Website hub for using FPGAs with oneAPI
– https://software.intel.com/content/www/us/en/develop/tools/oneapi/components/fpga.
html
▪ Reports
▪ Loop Optimization
▪ Memory Optimization
▪ Other Optimization Techniques
▪ Lab: Optimizing the Hough Transform Kernel
49
Reports
50
HTML Report
51
Optimization Report – Throughput Analysis
52
Optimization Report – Area Analysis
53
Optimization Report – Graph Viewer
54
Optimization Report – Schedule Viewer
▪ Schedule in clock
cycles for different
blocks in your code
55
HTML Kernel Memory Viewer
56
Profiler
57
Collecting Profiling Data
▪ Run the executable that integrates the kernel with the profiler using
58
Importing Profile Data into Intel® Vtune™ Profiler
59
Loop Optimization
60
Types of Kernels (Review)
▪ For FPGAs, the compiler will automatically detect the kind of kernel input
▪ Loop analysis will only be done for single work-item kernels
▪ Most loop optimizations will only apply to single work-item kernels
▪ Most optimized FPGA kernels are single work-item kernels
61
Single Work-Item Kernels
62
NDRange Kernels
▪ Kernels launched with the command group handler member function parallel_for() or
parallel_for_work_group() with a NDRange/work-group size of >1.
▪ Much of this section will not apply to NDRange kernels
…//application scope
queue.submit([&](handler &cgh) {
auto A = A_buf.get_access<access::mode::read>(cgh);
auto B = B_buf.get_access<access::mode::read>(cgh);
auto C = C_buf.get_access<access::mode::write>(cgh);
});
…//application scope
63
Understanding Initiation Interval
64
Understanding Initiation Interval
65
Understanding Initiation Interval
66
Loop Pipelining vs Serial Execution
Serial execution is the worst case. One iteration needs to complete fully before
a new piece of data enters the pipeline.
Op 2 i1
Op 2
Op 3 i0
Op 3
For End i0
67
In-Between Scenario
▪ Memory Dependency
– Kernel cannot retrieve
data fast enough from
memory
_accumulators[(THETAS*(rho+RHOS))+theta] += increment;
Value must be retrieved from global memory
and incremented
69
What Can You Do? Use Local Memory
…
Transfer global memory contents to local constexpr int N = 128;
queue.submit([&](handler &cgh) {
memory before operating on the data auto A =
A_buf.get_access<access::mode::read_write>(cgh);
cgh.single_task<class optimized>([=]() {
… int B[N];
constexpr int N = 128;
queue.submit([&](handler &cgh) { for (unsigned i = 0; i < N; i++)
auto A = B[i] = A[i];
A_buf.get_access<access::mode::read_write>(cgh);
for (unsigned i = 0; i < N; i++)
cgh.single_task<class unoptimized>([=]() { B[N-i] = B[i];
for (unsigned i = 0; i < N; i++)
A[N-i] = A[i]; for (unsigned i = 0; i < N; i++)
} A[i] = B[i];
}); });
70
What Can You Do? Tell the Compiler About Independence
▪ [[intelfpga::ivdep]]
– Dependencies ignored for all accesses to memory arrays
[[intelfpga::ivdep]]
for (unsigned i = 1; i < N; i++) {
A[i] = A[i – X[i]]; Dependency ignored for A and B array
B[i] = B[i – Y[i]];
}
▪ [[intelfpga::ivdep(array_name)]]
– Dependency ignored for only array_name accesses
[[intelfpga::ivdep(A)]]
for (unsigned i = 1; i < N; i++) { Dependency ignored for A array
A[i] = A[i – X[i]]; Dependency for B still enforced
B[i] = B[i – Y[i]];
}
71
Why Else Could This Happen?
▪ Data Dependency
– Kernel cannot complete a
calculation fast enough
72
What Can You Do?
▪ Do a simpler calculation
▪ Pre-calculate some of the operations on the host
▪ Use a simpler type
▪ Use floating point optimizations (discussed later)
▪ Advanced technique: Increase time (pipeline stages)
between start of calculation and when you use answer
– See the “Relax Loop-Carried Dependency” in the Optimization Guide for
more information
73
How Else to Optimize a
Loop? Loop Unrolling Iteration
1
Stage 1 Stage 2 Stage 3
});
74
Fmax
▪ The clock frequency the FPGA will be clocked at depends on what hardware
your kernel compiles into
▪ More complicated hardware cannot run as fast
▪ The whole kernel will have one clock
▪ The compiler’s heuristic is to sacrifice clock frequency to get a higher II
75
How Can You Tell This Is a Problem?
cgh.single_task<example>([=]() {
int res = N;
#pragma unroll 8
for (int i = 0; i < N; i++) {
res += 1;
res ^= i;
}
acc_data[0] = res;
});
76
What Can You Do?
77
Area
The compiler sacrifices area in order to improve loop performance. What if you
would like to save on the area in some parts of your design?
▪ Give up II for less area
– Set the II higher than what compiler result is
[[intelfpga::ii(n)]]
▪ Give up loop throughput for area
– Compiler increases loop concurrency to achieve greater throughput
– Set the max_concurrency value lower than what the compiler result is
[[intelfpga::max_concurrency(n)]]
78
Memory Optimization
79
Memory Model Kernel
Global Memory
▪ Private Memory
– On-chip memory, unique to Workgroup
Workgroup
work-item These are the same Workgroup
Local Memory
Workgroup
for single_task Local Memory
▪ Local Memory kernels
Local Memory
Local Memory
Work-ite Work-ite
– On-chip memory, shared within Work-ite Work-ite
m m
Work-ite Work-ite
m m
Work-ite Work-ite
workgroup m
m
m
m
Private Private
Private Memory
Private
▪ Global Memory
Memory Private Private
Memory Private
Memory Private
Memory Memory
Memory Memory
– Visible to all workgroups
80
Understanding Board Memory Resources
Key takeaway: many times the solution for a bottleneck caused by slow
memory access will be to use local memory instead of global
81
Global Memory Access is Slow – What to Do? (4)
…
We’ve seen this before... This will appear as a constexpr int N = 128;
memory dependency problem queue.submit([&](handler &cgh) {
auto A =
Transfer global memory contents to local A_buf.get_access<access::mode::read_write>(cgh);
82
Local Memory Bottlenecks
83
Local Memory Bottlenecks
M20K
port 0 M20K
M20K
Local Memory Interconnect
Kernel Pipeline port 1 M20K
M20K
M20K
84
Double-Pumped Memory Example
array[ind1] = val;
array[ind1+1] = val;
85
Local Memory Replication Example
//kernel scope
…
int array[1024];
int res = 0;
ST array[ind1] = val;
#pragma unroll
for (int i = 0; i < 9; i++)
LD res += array[ind2+i];
calc = res;
…
86
Coalescing
//kernel scope
…
local int array[1024];
int res = 0;
#pragma unroll
for (int i = 0; i < 4; i++)
array[ind1*4 + i] = val;
#pragma unroll
for (int i = 0; i < 4; i++)
res += array[ind2*4 + i];
calc = res;
…
87
Banking
array[ind1][0] = val1;
array[ind2][1] = val2;
calc = (array[ind2][0] +
array[ind1][1]);
…
Compiler looks at lower indices by default
Indices for banking must be a power of 2 size
88
Attributes for Local Memory Optimization
Note: Let the compiler try on it’s own first. It’s very good at inferring an optimal structure!
Attribute Usage
numbanks [[intelfpga::numbanks(N)]]
bankwidth [[intelfpga::bankwidth(N)]]
singlepump [[intelfpga::singlepump]]
doublepump [[intelfpga::doublepump]]
max_replicates [[intelfpga::max_replicates(N)]]
simple_dual_port [[intelfpga::simple_dual_port]]
Note: This is not a comprehensive list. Consult the Optimization Guide for more.
89
Pipes – Element the Need for Some Memory
Global Memory
Re
rite
ad
W
CCP 1 Pipe CCP 2 Pipe CCP 3
90
Task Parallelism By Using Pipes
91
Lab: Optimizing the Hough
Transform Kernel
Other Optimization Techniques
93
Avoid Expensive Functions
94
Inexpensive Functions
▪ Examples
– Binary logic operations such as AND, NAND, OR, NOR, XOR, and XNOR
– Logical operations with one constant argument
– Shift by constant
– Integer multiplication and division by a constant that is to the power of 2
– Bit swapping (Endian adjustment)
95
Use Least-“Expensive” Data Type
96
Floating-Point Optimizations
▪ Floating-point optimizations:
– Tree Balancing
– Reducing Rounding Operations
97
Tree-Balancing
99
References and Resources
References and Resources
▪ Website hub for using FPGAs with oneAPI
– https://software.intel.com/content/www/us/en/develop/tools/oneapi/components/fpga.
html
…and others!
102
Legal Disclaimers/Acknowledgements
103
Notices & Disclaimers
This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your
Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.
The products and services described may contain defects or errors known as errata which may cause deviations from published specifications. Current characterized errata
are available on request. No product or component can be absolutely secure. Intel technologies’ features and benefits depend on system configuration and may require
enabled hardware, software or service activation. Learn more at intel.com, or from the OEM or retailer.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and
MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to
vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product
when combined with other products. For more complete information visit www.intel.com/benchmarks.
INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL
PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED
WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE,
MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
Copyright ©, Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Core, VTune, and OpenVINO are trademarks of Intel Corporation or its subsidiaries in the U.S.
and other countries.
Optimization Notice
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations
include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on
microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not
specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the
specific instruction sets covered by this notice.
Notice revision #20110804
Optimization Notice
Copyright © 2019, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
104