0% found this document useful (0 votes)
93 views15 pages

DML Dynamic Partial Reconfiguration With Scalable Task Scheduling For Multi-Applications On FPGAs

Uploaded by

adityajoshi.ec21
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
93 views15 pages

DML Dynamic Partial Reconfiguration With Scalable Task Scheduling For Multi-Applications On FPGAs

Uploaded by

adityajoshi.ec21
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

IEEE TRANSACTIONS ON COMPUTERS, VOL. 71, NO.

10, OCTOBER 2022 2577

DML: Dynamic Partial Reconfiguration With


Scalable Task Scheduling for Multi-Applications
on FPGAs
Ashutosh Dhar , Member, IEEE, Edward Richter , Student Member, IEEE, Mang Yu, Wei Zuo,
Xiaohao Wang, Nam Sung Kim , Fellow, IEEE, and Deming Chen , Fellow, IEEE

Abstract—For several new applications, FPGA-based computation has shown better latency and energy efficiency compared to CPU or
GPU-based solutions. We note two clear trends in FPGA-based computing. On the edge, the complexity of applications is increasing,
requiring more resources than possible on today’s edge FPGAs. In contrast, in the data center, FPGA sizes have increased to the point
where multiple applications must be mapped to fully utilize the programmable fabric. While these limitations affect two separate domains,
they both can be dealt with by using dynamic partial reconfiguration (DPR). Thus, there is a renewed interest to deploy DPR for FPGA-based
hardware. In this work, we present Doing More with Less (DML) – a methodology for scheduling heterogeneous tasks across an FPGA’s
resources in a resource efficient manner while effectively hiding the latency of DPR. With the help of an integer linear programming (ILP)
based scheduler, we demonstrate the mapping of diverse computational workloads in both cloud and edge-like scenarios. Our novel
contributions include: enabling IP-level pipelining and parallelization to exploit the parallelism available within batches of work in our
scheduler, and strategies to map and run multiple applications simultaneously. We consider the application of our methodology on real world
benchmarks on both small (a Zedboard) and large (a ZCU106) FPGAs, across different workload batching and multiple-application
scenarios. Our evaluation proves the real world efficacy of our solution, and we demonstrate an average speedup of 5X and up to 7.65X on a
ZCU106 over a bulk-batching baseline via our scheduling strategies. We also demonstrate the scalablity of our scheduler by simultaneously
mapping multiple applications to a single FPGA, and explore different approaches to sharing FPGA resources between applications.

Index Terms—Partial reconfiguration, integer linear programming, scheduling, FPGA, dynamic reconfiguration

1 INTRODUCTION Dynamic partial reconfiguration (DPR) presents itself as a


strong solution by exploiting the ability to dynamically
PGAS have proven to be capable of delivering energy-effi-
F cient and high-performance accelerators from edge devi-
ces to data centers. However, we have observed several
reconfigure portions of the FPGAs to map workloads in frac-
tions at a time. DPR provides the ability to configure portions
of the programmable fabric while other sections continue
challenges toward widespread adoption of FPGAs. First,
running. Effectively leveraging DPR, however, is fraught
the complexity of workloads continues to grow rapidly, cre-
with challenges. First, the latency of DPR can be large, in the
ating a demand for more FPGA resources. Second, diverse
order of several milliseconds, adding a significant latency.
applications may need to be mapped to the same FPGA.
Second, the design of DPR based FPGA designs is time con-
Third, systems may require sharing the FPGA’s resources
suming and difficult, with the designer having to perform
between multiple applications or users. Hence, we need a
manual placement of reconfiguration regions. This makes
flexible, high performance, and portable solution to enable
scaling designs across different FPGAs difficult. Third, DPR
simultaneous mapping of large and complex workloads on
designs have statically designed and placed partitions. This
FPGAs. This requirement spans edge and cloud systems,
makes mapping several workloads to a single large FPGA
each of which has unique constraints.
with DPR to be incredibly laborious. Finally, accelerators
often work on batches of data in cloud and edge systems to
increase efficiency. Modern DPR solutions must be cogni-
 The authors are with the Department of Electrical and Computer Engineer- zant of this to exploit all available parallelism and minimize
ing, University of Illinois at Urbana-Champaign, Urbana, IL 61801 USA.
E-mail: {adhar2, edwardr2, mangyu2, weizuo, xwang165, nskim, dchen} any DPR overheads. Thus, leveraging DPR requires finding
@illinois.edu. a solution to two problems – scheduling and mapping.
Manuscript received 31 May 2021; revised 7 Dec. 2021; accepted 14 Dec. 2021. Scheduling must decide: (1) when to trigger partial reconfig-
Date of publication 23 Dec. 2021; date of current version 8 Sept. 2022. uration and deploy a portion of the workload to the FPGA,
This work was supported in part by the IBM-ILLINOIS Center for Cognitive and (2) in what order should each fraction of the workload
Computing Systems Research (C3SR) - A research collaboration as part of the
IBM AI Horizons Network, in part by the Xilinx Center of Excellence, the be deployed and executed. The mapping solution decides in
Xilinx Adaptive Compute Clusters (XACC) program at the University of Illi- what region of the FPGA should each fraction of the work-
nois Urbana-Champaign, and in part by the NSF CNS under Grant 17-05047. load be placed.
(Corresponding author: Edward Richter.) In this work, we present an end-to-end solution that con-
Recommended for acceptance by J. Hormigo.
Digital Object Identifier no. 10.1109/TC.2021.3137785 siders all sizes of FPGAs, and can perform simultaneous

0018-9340 © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See ht_tps://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: R V College of Engineering. Downloaded on December 10,2024 at 14:49:42 UTC from IEEE Xplore. Restrictions apply.
2578 IEEE TRANSACTIONS ON COMPUTERS, VOL. 71, NO. 10, OCTOBER 2022

mapping of multiple applications, and parallel pipelines, 2 RELATED WORK


to the same device. We call our solution Doing More with
Less (DML). Our work tackles the scalability challenge of Using DPR to map large workloads has been explored pre-
traditional DPR solutions and leverages integer linear pro- viously; however, these attempts have focused on a particu-
gramming (ILP) based schedulers for optimal simulta- lar application or domain of applications [1], [2]. The use of
neous scheduling and mapping. We combine our offline DPR and task-based scheduling have been explored as
static ILP-based scheduler and mapper with an online well [3], [4], [5], [6], [7], [8], including ILP-based solu-
dynamic runtime in hardware that executes the global tions [5], [6], [7], [8]. However, these approaches either focus
DPR order and mapping solution, by combining it with on (1) a specific optimization or application, or (2) improv-
additional dynamic information available at runtime. The ing the speed and performance of ILP-based scheduling via
scheduler embraces workload batching by enabling task heuristics. In contrast to our work, they do not consider the
pipelining and unrolls batches to run parallel pipelines. constraints and requirements of real-world applications
We also present an architecture that partitions the FPGA such as the need for pipelining and parallelizing batched
resources into uniform pieces (called Slots), which can be workloads, sharing FPGAs between multiple applications,
dynamically reconfigured as bespoke accelerator parti- and portability from one FPGA to another.
tions (called IPs). This enables accelerator designers to be Prior ILP-based approaches [6], [7], [8] have attempted to
abstracted away from the limitations of DPR and physical mask the latency of reconfiguration by prefetching configura-
design, enables simplified mapping of multiple workloads tions, similar to our work. However, in these works, the
to a single FPGA simultaneously, and allows designs to authors consider a 2D reconfiguration problem, wherein the
be portable across cloud and edge FPGAs. We further FPGA is divided into uniform rows and columns, and tasks
enhance our work by extending our ILP solution to tackle can require different amounts of resources. Thus, each work-
the scheduling and mapping of multiple applications and load results in a new floorplan, with each task of the work-
very large applications across FPGAs of all sizes. We also load occupying a different amount of resources. In contrast
explore different mapping strategies for sharing FPGA to our problem formulation, this approach is based on older
resources between multiple applications via our sched- Xilinx architectures (Virtex 4 and 5), and is not scalable as it
uler. DML uses high quality schedules and mappings does not consider devices of different sizes, portability, and
from the static ILP-scheduler, is able to minimize design does not simplify or speedup the DPR-based designs for
time effort, and leverages dynamic runtime information, modern computational workloads. Our solution considers
and thus provides a balance between usability, scalablity, and improves ease of use, portability, performance, and
and performance. flexibility.
We summarize our contributions as follows: In Deiana et al. [5] the authors attempt to perform schedul-
ing and mapping, but the complexity of their ILP limits the
1) We present DML, an end-to-end DPR methodology. scalability, and they demonstrate the use of their iterative
Our solution is comprised of a scalable architecture, solution to schedule up to five tasks at a time only, limiting
based on the constraints of partial reconfiguration, the efficiency of the solution. A two-stage process was pro-
and a novel static ILP-based scheduler and mapper. posed by Purgato et al. [4], wherein an additional step is
Our scheduler is capable of pipelining and paralleliz- needed to find a feasible floorplan solution. This non-ILP
ing across batch elements, and demonstrates signif- approach cannot guarantee optimality. In contrast, our
ciant performance gains over naive bulk scheduling, approach standardizes the search space by using uniform-
and works in concert with a dynamic runtime execu- sized DPR partitions (slots), and simultaneously provides a
tion engine. schedule and mapping solution. Our formulation simplifies
2) We provide in-depth design space exploration, and the ILP problem, helping find optimal solutions for larger
explore multiple batching and partitioning schemes. graphs, and scheduling and mapping more tasks at a time.
We also evaluate different strategies for simulta- Multiple recent works have presented frameworks cen-
neously mapping and scheduling multiple applica- tered around DPR [9], [10]. ReconOS [9] is a framework which
tions to an FPGA. extends multithreaded programming abstractions to reconfig-
3) Our solution scales from small to large FPGAs, and urable hardware utilizing DPR. ARTICo [10] is a DPR frame-
we demonstrate the simultaneous mapping of ten work, which provides an automatic toolflow to generate
applications to a single FPGA via our scheduler. partial bitstreams and a runtime to execute on a number of
4) We validate our framework by evaluating real-world fixed sized slots. DML differentiates from these works as nei-
benchmarks in hardware, across large and small ther work uses static information to generate a high-perfor-
FPGAs. We demonstrate an average speedup of 5X mance schedule and mapping to combine with their dynamic
and up to 7.65X on a ZCU106 via our novel schedul- runtime. Also, neither framework automatically enables opti-
ing strategies over naive bulk scheduling. mizations such as pipelining and parallelism.
The rest of this paper is organized as follows. In Section 2 Finally, many tools have been proposed to automatically
we discuss related and prior work on DPR, followed by an perform stages of the DPR pipeline such as floorplanning
overview of our methodology in Section 3. We then discuss [11], [12] and application partitioning [11]. DARTS [11] is a
our scheduler, ILP formulation, and mapping strategies for framework which utilizes a mixed ILP formulation to auto-
large graphs and multiple applications in Section 4, provide matically generate a floorplan, partition applications, and
a detailed evaluation in Section 5, and finally conclude in generate schedules for real-time applications. In contrast to
Section 6. DML, wherein our objective is to minimize the end-to-end
Authorized licensed use limited to: R V College of Engineering. Downloaded on December 10,2024 at 14:49:42 UTC from IEEE Xplore. Restrictions apply.
DHAR ET AL.: DML: DYNAMIC PARTIAL RECONFIGURATION WITH SCALABLE TASK SCHEDULING FOR MULTI-APPLICATIONS ON... 2579

and not by how many resources within the DPR region are
in use. Note, that while we focus on Xilinx Zynq and Xilinx
Zynq Ultrascale+ [14], [15] series FPGA-SoCs in this work,
our solution is not limited to Xilinx devices. Fig. 1 presents
the system architecture we have designed in this work to
help overcome the challenges in leveraging DPR. We desig-
nate each slot as a resource partition that includes a recon-
figurable partition and a fixed interface. All slots are
Fig. 1. Proposed system architecture.
uniform in their resources and since DPR requires slot inter-
faces to be uniform, we use AXI-based buses to create their
latency of a given application or group of applications, interfaces. The static region of the FPGA hosts the global
DARTS goal is timing predictability and works towards a AXI interconnect, to which the slots connect.
solution for a user-provided time constraint. Furthermore, To map an application to our architecture, we partition it
DML splits up hardware applications into IPs, which are at a task level and represent it as a task graph. The task graph
independently mapped and scheduled with the precedence is a directed acyclic graph (DAG), GðV; EÞ, such that each
constraints of the task-graph represented by the constraints vertex, vi 2 V , is a task, and each edge, eij 2 E, represents a
of the ILP. DML can then leverage automatic optimizations dependency between tasks such that vi must complete before
such as pipelining and parallelism across IPs. vj can begin execution. We then create IPs for each task, and
In comparison to prior attempts, our scheduler distin- assign the latency of the IP as the weight of the vertex in the
guishes itself with four novel features: (1) Ability to pipe- task graph. This task graph model is illustrated in Fig. 3a
line, (2) Unrolling and parallelizing across batch elements, where each vertex represents a task of the application that
(3) Graph partitioning optimization to enable the mapping has its own IP. Edge weights may be used to represent the
of very large graphs, and (4) Simultaneous scheduling and communication latency. IPs may be designed in any fashion
mapping. In addition, in this work, we do not rely on syn- and allow users to deliver fine-grained customization on a
thetic graphs. Rather, we consider multiple real-world per-task level. Alternatively, users may choose to group sev-
applications from the Rosetta [13] benchmark suite and vali- eral tasks or kernels into a single large task and IP. Once the
date our framework by testing on real hardware. Finally, we application has been represented as a task graph, it must be
demonstrate the simultaneous scheduling and mapping of scheduled and mapped to slots on the FPGA, which we dis-
multiple applications on a single FPGA and present insights cuss in the next section. A key advantage of our approach is
into what strategies work best for such scenarios. that by grouping together task graphs of multiple applica-
tions, we can create a single task graph. Thus, we can simul-
taneously schedule multiple applications on a single FPGA,
3 DOING MORE WITH LESS without changes to the applications, floorplan, or scheduler.
We now present our Doing More with Less (DML) framework Note that while DAGs by definition cannot have cycles,
– an end-to-end and generic methodology that enables any DML can address statically resolvable cyclical patterns in the
workload, or multiple workloads, to be efficiently mapped task graph by either unrolling the cycles or absorbing the
to FPGAs of all sizes, with DPR. Our solution is comprised cycles into a single node.
of two key parts. First, we partition the FPGA into uniform The size, shape, and location of the slots are determined
pieces, that we call slots and provide a scalable architecture, based on the DPR constraints of the FPGA, and their height
as shown in Fig. 1. Second, we propose an ILP-based opti- spans the entire clock region. This eliminates several of the
mizer that schedules and maps work into slots, while amor- constraints involved in 2D reconfiguration described in ear-
tizing the latency of reconfiguration by overlapping the lier works [6], [7], [8]. Slot sizes can be set to ensure that the
computation with reconfiguration. Our scheduler is capable application’s IPs are able to fit into them, if the IP library
of pipelining and parallelizing applications by leveraging already exists. Note that the size of a slot determines the
the data parallelism available across elements in batches of latency of DPR and limits the performance of an IP that can
work, and uses a graph partitioning strategy to map very be mapped into it. Should an application’s task require an
large task graphs. Finally, our flexible architecture and IP that is too large for a slot size, then we must either split
scheduler enable us to simultaneously schedule and map the IP into smaller partitions that map to more than one
multiple applications on an FPGA. slot, or we must scale back the IP’s performance and reduce
Leveraging dynamic partial reconfiguration (DPR) its resource requirements. Finally, all slots communicate via
requires manual floorplanning to carve out and designate AXI in the global address space, i.e., DRAM. Hence, we set
specific regions as static or dynamically reconfigurable. the number of slots per FPGA such that the total required
Thus, the designer must decide where to physically place DRAM bandwidth does not cause bus contention.
the accelerator. In addition, there are several architectural The use of uniform slots is a compromise we make to
constraints and design rules that must be considered. A key speedup the design time, scale across all devices, and map
limiting factor is the speed of DPR, which is determined by multiple applications to the same device simultaneously. In
the bandwidth available in the Configuration Access Port cloud-like scenarios, where multiple applications may need
(CAP), and the size of the partial bitstream. The CAP band- to be mapped, rapid deployment and reducing physical
width is architecture specific and may not be changed, how- design effort are very valuable. Thus, our architecture pro-
ever, the size of the partial bitstream is determined by the vides two key advantages: First, by employing fixed recon-
size of the dynamically configurable region (DPR Region) figuration partitions, the designer does not need to perform
Authorized licensed use limited to: R V College of Engineering. Downloaded on December 10,2024 at 14:49:42 UTC from IEEE Xplore. Restrictions apply.
2580 IEEE TRANSACTIONS ON COMPUTERS, VOL. 71, NO. 10, OCTOBER 2022

Fig. 3. (a) Example task graph. Each vertex in this graph represents a
single task, and the directed edges between vertices represent task
Fig. 2. Doing More with Less methodology and flow. dependencies. An IP is created for each vertex, and the weight of each
vertex is the latency of its corresponding IP. (b) Sample cut solutions for
a large task graph. Each cut has a max size of seven vertices.
floor planning for each application, and can design accelera-
tors with defined IO constraints, and be ensured that it will a dynamic scheduler to map any IP to any slot at runtime. This
scale across devices. Second, the fixed slot sizes simplify the adds significant design effort and time overheads.
scheduling and mapping constraints, helping to deliver
a scalable and deterministic design with uniform DPR
latency. The use of fixed slot sizes may not guarantee the 4 ILP BASED SCHEDULING
best utilization of the fabric, and requires some thought and
At the heart of the scheduler is our ILP formulation. Our
planning by the IP designer, as is the case in any IP design
goal is to find a high quality solution, while minimizing the
effort. However, we believe that the aforementioned bene-
traditional time costs of ILP-based solutions. Our ILP solu-
fits far outweigh the utilization benefits of using variable
tion performs simultaneous scheduling and mapping and
slot sizes. In addition, a different slot size can be selected to
can provide an optimal solution on reasonable graph sizes.
better suit the application(s).
Crucially, we consider real-world deployment constraints,
Fig. 2 presents the overall flow and framework of our
and include the ability to pipeline and parallelize tasks
solution. For a given FPGA, we have an overlay architecture
across batches. While, our slot-based architecture helps sim-
that determines the number of slots and DPR latency, and
plify the ILP formulation, finding a solution for the ILP can
for a given application that needs to be mapped to the
be slow and does not scale well to large graphs. Hence, we
FPGA, we have a task graph comprised of kernels. Note
use heuristic schedulers to help tune the ILP solver’s search
that this is not a computational overlay, such as a CGRA or
space, and explore different partitioning strategies to find
a systolic array. Our architecture is flexible and allows us to
scheduling solutions when trying to map large graphs. We
deliver application and task-specific specialization with
also extend our framework to support two different solu-
high performance. We begin with kernels in the task graph
tions for mapping multiple applications.
and generate IPs for them via high-level synthesis (HLS).
Our ILP solution performs simultaneous scheduling and
We then use the reported latency of the IPs, architectural
mapping, can provide an optimal solution on reasonable
parameters, the chosen level of scheduler optimization
graph sizes, and takes into account our scalable architecture,
(pipelining and parallelism factor), and the task graph as
which helps loosen the ILP constraints. Crucially, we con-
inputs to our static ILP-based scheduler. Note, DML is not
sider real-world deployment constraints, and include the
suitable for applications with IPs whose latency cannot be
ability to pipeline and parallelize tasks across batches, and
estimated prior to runtime. The scheduler then delivers a
include two different solutions for mapping multiple appli-
mapping solution and a DPR and IP execution schedule
cations to a single FPGA.
which can be executed on the hardware by the dynamic
runtime. The mapping solution, and DPR and IP execution
schedules are represented by three components: (1) global 4.1 ILP Formulation
DPR order, which is a list of IPs in the order for them to be The input to the ILP is an application task-graph, IP laten-
reconfigured on the slots (2) IP slot mappings, which map cies, DPR latency, and resource constraints. Our ILP simul-
each IP to the physical slot on the device it will be run on, taneously looks for a schedule, that provides the global
and (3) dependencies, each IP has a list of dependencies DPR order, and a mapping solution. We formally describe
extracted from the task graph and used by the runtime. We the ILP formulation as follows:
discuss the operation of this runtime in the next section. Given: (1) A task graph GðV; EÞ as described in Section 3;
In parallel, we use an automated version of the process (2) A set of scheduling constraints, Cs ; and (3) A set of
described in [16] to generate partial bitstreams, which we resource constraints, Cr . The scheduling constraints include
call the Bitstream Generator. We use synthesized design dependencies inferred from the graph, latency of nodes in
checkpoints of the IP and our custom overlay floorplan to the graph, and DPR latency. In addition, we must ensure
generate partial bitstreams. We feed the partial bitstreams only one partial reconfiguration is done at a time. The
to a runtime that executes the applications based on the pro- resource constraints are the number of available slots, as
vided mapping and schedule. This runtime is implemented provided by the user. Additionally, the user may select opti-
in software and runs on the processing system (PS) of the mizations, such as pipelining or parallelization, to be
SoC-FPGA. Finally, note that without a slot mapping solu- included. We will discuss these later in this section.
tion from our scheduler, users would need to design and Goal: Minimize the latency of the entire task graph, such
generate bitstreams for every possible IP-slot pair to enable that each task’s IP(s) are allocated a slot and experiences the
Authorized licensed use limited to: R V College of Engineering. Downloaded on December 10,2024 at 14:49:42 UTC from IEEE Xplore. Restrictions apply.
DHAR ET AL.: DML: DYNAMIC PARTIAL RECONFIGURATION WITH SCALABLE TASK SCHEDULING FOR MULTI-APPLICATIONS ON... 2581

latency of reconfiguration prior to the IP executing in the the i-th IP’s DPR and the latency of reconfiguration. Hence,
slot, such that all constraints in Cs and Cr are satisfied. we add
ILP Variables: We will now define the variables that we
Sj  Si þ Li ; 8ei;j 2 E (7)
will solve to find our solution. We define the set V as all the
vertices in the given graph. The sets L and Lpr contain the Si  Spri þ Lpri ; 8i 2 V: (8)
execution latency of each node in V , and the latency of
reconfiguration, respectively. Then, we define the variables Overlap Constraints: Next, we define our overlap con-
S and Spr, as timestamps, where S 2 Z are the start times of straints that ensure DPR doesn’t begin before the previous
all IPs in the set V , and Spr 2 Z are the start times of the cor- IP in the slot has completed, only one DPR can be per-
responding partial reconfigurations of each node in V . Next, formed at a time, and IPs mapped to the same slot do not
we describe our resource mapping variables. Let the num- try to overlap their execution. Note that in (9) to (14), we use
ber of available slots be Rs . Then, we define a binary vari- the variables B1ijk , B2ijk , and B3ijk as a tool to help express
able Mik as Mik ¼ 1 if the i-th IP, vi , maps to the k-th slot, an either/or inequality in a way that is amenable to ILP,
where vi 2 V and k 2 Rs . Next, since any IP can be mapped and C1 , C2 , and C3 are large enough constants that we set.
to any slot, provided the slot is not occupied, we must We further explain and provide insight into these variables
express the resource sharing between IPs. We define the later in the section.
binary variable Yijk as Yijk ¼ 1 if the i-th and j-th IP map to Our first constraint is added to enable DPR to overlap
the k-th slot, where vi 2 V , vj 2 V , and k 2 Rs . Finally, vari- with computation. If the i-th and j-th IP map to the same
ables B1ijk , B2ijk , and B3ijk are Boolean decision variables slot, and DPR of the i-th IP takes place before the j-th IP,
that we use to help encode our overlap constraints. Their then the start time of the j-th IP must be greater than the end
solution is determined by the ILP solver. We also add C1 , time of the i-th IP.
C2 , and C3 as large enough constants, and discuss how to Thus for all pairs of i and j, i 2 V , j 2 V
set them later in this section. Next, we describe our system
of equations that formulate the constraints of our problem. Spri  Sj þ C1 :B1ijk  Lj :Yijk ; 8k 2 Rs (9)
Legality Constraints: We encode the fundamental con-
straints of the system by defining the solution space. We Sprj  Si þ C1 :ð1  B1ijk Þ  Li :Yijk ; 8k 2 Rs : (10)
must enforce bounds on start times, ensure that an IP maps
to only one slot, and only allow IPs to share a slot one at a We must also ensure that if two IPs are mapped to the
time. We begin by enforcing that the start times of all opera- same slot, they cannot overlap their execution. Thus, if the i-
tions must be positive th and j-th IP map to the same slot, the start time of the i-th
IP must be greater than the end time of the j-th IP, or vice
Si > 0; 8 i 2 V (1) versa. Hence we have

Spri > 0; 8 i 2 V: (2) Si  Sj þ C2 :B2ijk  Lj :Yijk ; 8k 2 Rs (11)

Sj  Si þ C2 :ð1  B2ijk Þ  Li :Yijk ; 8k 2 Rs : (12)


Then, we enforce that an IP can only map to one slot, by
using the resource mapping binary variable, Mik , and ensur-
Finally, we must ensure that DPRs cannot overlap, since
ing only one slot-mapping is set, per IP. Thus we have
only one DPR can occur at a time. Thus, for all pairs of i-th
X and j-th IPs, the DPR start time of the i-th IP must be greater
Mik ¼ 1; 8 i 2 V: (3)
k
than the DPR end time of the j-th IP, and vice versa. We add
the constraints
Finally, we need to define the resource mapping binary
Spri  Sprj þ C3 :B3ijk  Lprj ; 8k 2 Rs ; 8i; j 2 V; i 6¼ j
variable Yijk , that tracks if two IPs use the same slot. Yijk is
key to our ability to constrain and allow IPs and DPR to (13)
overlap. Hence, we define Yijk as Yijk ¼ 1 if and only if,
Sprj  Spri þ C3 :ð1  B3ijk Þ  Lpri ; 8k 2 Rs ; 8i; j 2 V; i 6¼ j:
Mik ¼ 1 and Mjk ¼ 1, 8k 2 Rs and 8 i; j 2 V . We can express
this as (14)

Yijk  Mik þ Mjk  1 (4) As we mentioned earlier, in (9) to (14), we use the varia-
bles B1ijk , B2ijk , and B3ijk as a tool to help express an either/
Yijk  Mik (5) or inequality in a way that is amenable to ILP. These binary
variables encode precedence relationships between configu-
Yijk  Mjk : (6) ration and execution of IPs i and j. For example, in (9),
B1ijk ¼ 1 encodes that IP i is configured before IP j is run,
Latency and Dependency Constraints: Next, we define while B1ijk ¼ 0 encodes IP j is configured before IP i is run.
our latency and edge dependency constraints. If there is an Similarly, B2ijk encodes the precedence relationship between
edge eij 2 E, from the i-th IP to the j-th IP, then the start execution times of IP i and j, while B3ijk encodes the prece-
time of the j-th IP must be greater than the sum of the start dence relationship between configuration times of IPs i and
time of the i-th IP and its latency. Also, an IP can only start j. The ILP solver finds a solution for these variables in con-
once its reconfiguration is complete. Thus, the start time of junction with all the other latency, legality, and overlap
the i-th IP must be greater than the sum of the start time of constraints.
Authorized licensed use limited to: R V College of Engineering. Downloaded on December 10,2024 at 14:49:42 UTC from IEEE Xplore. Restrictions apply.
2582 IEEE TRANSACTIONS ON COMPUTERS, VOL. 71, NO. 10, OCTOBER 2022

Fig. 4. Graphical depictions of three different schedules using the task graph in Fig. 3a with varying IP and DPR latencies.

Objective Function: In order to create our objective func- can see Eqs. (11)-(12) in action, and they allow IPs i, j, and l
tion, we create a sink node, Vsink, such that 8v 2 V there to overlap their execution, while Eqs. (9)-(10) helps allow i, k,
exists an edge between Vsink and v, and the start time of l, and m to overlap their DPR with IP executions. Note that
the sink node is Ssink . Thus our objective function is to mini- all four slots are used in this example. Finally, in Fig. 4c, we
mize the start time of the sink consider a situation where DPR latencies dominate, and thus
Eqs. (13)-(14) are key in enforcing that DPRs do not overlap,
MinðSsink Þ: (15) while Eqs. (9)-(10) improve performance by allowing DPR
and IP execution to overlap.
Leveraging Heuristics: To help express the conditional
constraints in Eqs. (9) to (14) we introduced C1 , C2 , and C3 4.2 Additional Support for Computational Workloads
as large enough constants, and introduced B1ijk , B2ijk , B3ijk Batching is commonly used in computational workloads
as Boolean decision variables, whose values are determined since each kernel needs to be run multiple times for a vari-
by the ILP solver. The constants help define the bounds of ety of inputs. It can also help amortize the cost of reconfigu-
the solution space and the value of the constant must be ration in some cases. We initially consider batching from
very close to the upper bound of the variable. Choosing too three approaches: (1) Bulk batching, wherein a single
small a value might result in an infeasible problem. To help instance of the IP is re-used for each entry in the batch. For a
find the bound, we use a heuristic list-scheduler to provide batch of size N we scale the latency of each IP by a factor of
a fast solution. Note that the list-scheduler does not consider N, thereby serializing the batch but increasing the time each
mapping solutions, is not optimal, and does not consider all IP must remain configured on the device before it can be
overlap constraints. So, we apply a safety margin to the swapped for another. (2) Parallel batching, wherein multi-
result of the list-scheduler to determine the bound. ple instances can run batch entries in parallel. We replicate
Illustrative Example: To help illustrate the operation of the graph N times, thereby creating N parallel instances,
the scheduler, we consider the task graph shown in Fig. 3a which can potentially lead to better slot utilization but
and present the resulting mapping and schedules from our blows up the size of the ILP problem, making it harder to
scheduler in three different scenarios with different IP and find a solution. Note that both of these approaches are pos-
DPR latencies in Fig. 4. For brevity, we restrict our example sible with our described ILP formulation. (3) Pipelining
to four slots, and consider scenarios where: (1) DPR latency across batches, wherein like bulk batching, each IP’s latency
dominates execution time (Fig. 4a), (2) DPR latency is easily is scaled by a factor of N; however, we allow dependent IP
masked by IP execution latencies (Fig. 4b), and (3) DPR and to overlap their execution since the dependencies exist
IP and latencies have varied ratios Fig. 4c. For consistency, within a batch entry only. Thus, each pipeline stage is an IP,
we use the same notations as presented in Section 4.1. operating on a separate entry in the batch. In order to do so,
Fig. 4 presents the solutions generated by our ILP formu- we extend our ILP formulation. If there is an edge eij 2 E,
lation. The generated solutions illustrate that our static ILP- from the i-th IP to the j-th IP, then the start time of the j-th IP
based scheduler will always try to minimize the total execu- must be greater than the sum of the start time of the i-th IP
tion time. Note that the ILP-based scheduler can find optimal and the latency of one entry in the batch. Also, the penulti-
solutions for reasonable problem sizes. However, given that mate batch of the j-th IP must begin only after the last batch
our target objective is to find the best performing order and of the i-th IP has completed. Thus we have
mapping, it is possible that multiple order-mapping solu-
tions provide the best performance. Thus, the delivered solu- Sj  Si þ Li =N; 8 i; j 2 V (16)
tion may not be intuitive. For example, in Fig. 4a, we can see
that the scheduler has opted not to use all four available slots, Sj þ ðN  1Þ:Lj =N  Si þ Li ; 8 i; j 2 V: (17)
as it can achieve the minimum latency with just three. This is
because IP m is dependent on IPs j, k, and l, and thus i and k Note that we assume that the latency of the i-th IP, Li has
form the critical path such that their execution latencies are already been scaled by a factor of N. However, should the
easily able to mask DPR latency. Moving IP l or IP m into slot latency of the j-th IP be much smaller than the i-th IP, we
0 would not have improved performance, but it would be must ensure that it still completes after the i-th IP and
another solution. Meanwhile, in Fig. 4b, we consider a situa- respects the dependency
tion where IP j is on the critical path, but the latency of DPR
and the remaining IPs remain the same as in Fig. 4a. Here we Sj þ Lj  Si þ Li ; 8 i; j 2 V: (18)
Authorized licensed use limited to: R V College of Engineering. Downloaded on December 10,2024 at 14:49:42 UTC from IEEE Xplore. Restrictions apply.
DHAR ET AL.: DML: DYNAMIC PARTIAL RECONFIGURATION WITH SCALABLE TASK SCHEDULING FOR MULTI-APPLICATIONS ON... 2583

Fig. 6. Illustration of dependent versus independent scheduling of multi-


ple applications.

Fig. 5. DML ILP scheduler and graph partitioning flow.


flow chart in Fig. 5, while Fig. 3b illustrates cuts and levels in
a large task graph.
4.3 Large Graphs and Multiple Applications Strategies for Multiple Applications: Without DPR, run-
ILP based solvers can be slow and do not scale to large ning multiple applications on an FPGA would require time
graphs. This can be challenging when we attempt to map multiplexing the applications over the entire FPGA, which
very large applications, leverage parallelism (as it dupli- would be slow and would limit the ability to exploit the
cates the graph), or map multiple applications on to a single available fine-grained customization and parallelism. We
FPGA. In this section we discuss our approach to dealing call this approach Coarse-Grained Scheduling. As we discussed
with very large task graphs and multiple applications. in Section 3, our flexible architecture and scheduler enables
Handling Large Graphs: While ILP solvers are slow, it is us to leverage DPR to share an FPGA across multiple appli-
possible to find a non-optimal solution for the entire task cations simultaneously. We combine the task graphs of all
graph or find optimal solutions for subgraphs within rea- applications into a single monolithic graph, and our sched-
sonable time. Thus, DML uses graph partitioning to sched- uler finds a solution for all applications simultaneously, as
ule large graphs. This is applicable for applications with we discussed in Section 4.3. We refer to this method as depen-
very large task graphs, leveraging parallel batching for a dent-scheduling, and illustrate it in Fig. 6. Note that any IP of
single application, or while trying to map multiple applica- any application is free to be mapped to any slot of the FPGA.
tions to a single FPGA. We illustrate our overall ILP-based This method can efficiently use all FPGA resources and infra-
scheduler flow in Fig. 5. We begin by setting upper bounds structure, and can attempt to provide the optimal end-to-end
on the size of the task graph and ILP solving time. If the latency for the monolithic DAG. However, there are a few
task graph is too big, or a solution cannot be found within potential disadvantages of this approach: (1) It only attempts
the time period, we partition the graph. to optimize the total latency, not the per-application latency,
DML partitions the task graph into smaller subgraphs, (2) The monolithic DAG can be very large which requires
called cuts, performs ILP-based scheduling of each cut, and graph cutting to generate a schedule in a timely fashion,
then sequentially concatenates the schedules and mappings which will create a less-performant schedule, and (3) As the
of each cut to create the final global schedule and mapping. graph is presented as a monolithic DAG, schedule genera-
As shown in Fig. 5, the graph partitioning begins by perform- tion can be slow as only one ILP solver is run.
ing topological sorting of the graph and sorts vertices into Thus, in this work, we consider an alternative approach
levels, such that vertices in a level are tasks/IPs that have no to multiple applications - independent-scheduling. We stati-
dependence on each other but are dependent on vertices in cally designate a number of slots to each application, and
higher levels. We then group together vertices, level-by- then independently run the scheduler on each application
level, into cuts of fixed sizes. When selecting vertices to be for their designated number of slots. Thus, each application
placed into a cut, DML uses two different strategies: (1) runs with its best performance and does not interfere with
sorted cuts, and (2) fair cuts. The sorted cut strategy consid- scheduling and ordering or other applications. This, how-
ers scenarios where we are trying to schedule and map mul- ever, does come at the cost of a potentially longer end-to-
tiple applications to a single FPGA at the same time, where end latency for the entire group of applications. We illus-
the task graph might be very large and have tasks/IPs with trate this scheme on the right side of Fig. 6. Note, however,
very different execution latencies. Grouping together verti- that the FPGA still has a single CAP interface that must be
ces into a cut without considering the different execution shared across all applications. Our runtime allocates this in
times may result in a final schedule with large bubbles in the a round-robin fashion to each application.
pipeline. Thus, in the sorted cut strategy, DML will first sort Our DML framework is flexible and can implement either
the vertices in a level by their latencies. Thus, when cuts are multiple application mapping strategy based on the optimi-
formed, we are less likely to have vertices with vastly differ- zation goals. We explore the difference between dependent
ent execution times, thereby improving FPGA utilization. In and independent multi-app scheduling in Section 5, and
contrast, The fair cut strategy considers the case of parallel demonstrate scenarios where each is beneficial. Finally, we
batching, where we could have a very large graph, such that note that another advantage of independent scheduling is
the latency of nodes in each level are identical to each other. the ability to tune the number of slots based on application
In this case, we can schedule all nodes without bias. Having characteristics. While intuitively one may think that the
formed the cuts, we then schedule and execute each cut number of slots required is proportional to the number of IPs
sequentially. This scheduling algorithm illustrated in the in an application’s task graph, our findings show that the
Authorized licensed use limited to: R V College of Engineering. Downloaded on December 10,2024 at 14:49:42 UTC from IEEE Xplore. Restrictions apply.
2584 IEEE TRANSACTIONS ON COMPUTERS, VOL. 71, NO. 10, OCTOBER 2022

number of slots needed depends on the topology of the task


graph, the latency of the IPs, and the batch size used. Sec-
tion 5.7 provides a quantitative analysis and demonstrates
the benefit of providing a bespoke number of slots to each
application.

4.4 Runtime Implementation


The DML runtime executes the schedule generated by the
scheduler. However, the static ILP scheduler is fed simulated
latencies of the IPs, which can differ from the actual hard-
ware latency. Thus, instead of using the exact IP start times
Fig. 7. Floorplans used to evaluate DML on (a) Zedboard and (b) Zynq
predicted by the scheduler, DML’s dynamic runtime uses Ultrascale+ ZCU106, for 1X size slots.
the mapping and global DPR order generated by the ILP-
based scheduler, along with the dependencies of the task modified the benchmarks by splitting them up into smaller
graph to execute the application. During execution, the run- task modules and generating uniform AXI interfaces (via
time iterates over the global DPR order. When the slot of the HLS pragmas) for the IPs. We attempt to get the best possi-
next IP in the global DPR order is available, and no previous ble performance from the available slot resources, and
DPR is running, the runtime will start DPR in the slot where possible we split the task IPs across multiple slots,
denoted by the mapping. While waiting for the next DPR, especially in data-parallel applications like the DNNs (AL4
the runtime iterates over each IP that has already been con- has four parallel branches). We also consider four synthetic
figured, and checks if the dependencies are complete. If the graphs, similar to those in Fig. 3b, to add diverse patterns to
dependencies are complete, the runtime will start the execu- our multi-application studies.
tion of the IP. This approach is beneficial as it combines the
static high-order information from the ILP-based scheduler, 5.1 Methodology
such as mapping and global DPR order, with the dynamic
Our scheduler is written in Python and uses CVXPY [17] and
information the runtime has such as the exact time DPR is
Gurobi 8.1 [18] for the ILP backend. Our experiments are run
available, or an application’s dependencies are complete.
on a cluster with Intel Xeon E5-2680 v4, and we restrict Gurobi
While the operation of the dynamic runtime makes it possi-
to use four threads only.1 ILP latencies of the application’s
ble that the IP start time in hardware differs from that pre-
task IPs have been generated from Xilinx Vivado HLS synthe-
dicted by the static scheduler, this does not impact
sis and co-simulation. In this section, we present data gener-
performance. We are simply adjusting DPR or IP start times
ated by our scheduler to perform a detailed sweep, design
by using available slack in the schedule, between IP execu-
space explorations, and to prove the scalability of the method-
tion ending and DPR beginning in a slot, while still following
ology. We present data for both, the Zedboard and the
the global DPR order, IP slot mapping, and dependencies.
ZCU106, as they have different architectures and DPR laten-
We quantitatively show in Section 5.3.1 that the acquired
cies. We also provide hardware validation of our methodol-
speedup in the hardware matches or outperforms that esti-
ogy by running the generated schedules on the Zedboard and
mated by the scheduler, which is further explained in our
the ZCU106. We performed manual floorplanning on both
evaluation.
devices, to carve out equal-sized programmable regions. We
were able to fit four 1X sized slots on the Zedboard, and ten
5 EVALUATION 1X sized slots on the ZCU106 board. Fig. 7 shows these two
We now present an evaluation and exploration of our DML floorplans, with labeled red rectangles denoting the slots.
framework. As we discussed in Section 3, our architecture Fig. 7b shows the 10-slot floorplan used on the ZCU106.
uses slots of uniform resources and interfaces. In this work, We can see that the static region, which hosts the AXI inter-
our slots include 10000 LUTs, 40 DSPs, and 40 BRAM18 connects, is placed in the middle of the board, near the PS-
units. We target two different sizes and architectures of PL interface, and with the programmable slots on the outer
FPGAs: (1) A Xilinx Zynq-7000 based Zedboard, and (2) A edges of the device. We also note the different slot aspect
Xilinx Zynq Ultrascale+ ZCU106 board. Slots on the Zed- ratios on the ZCU106 floorplan (i.e., Slot 6 is taller and thin-
board result in partial bitstreams of 1.2MiB that take 9.5ms ner than Slot 0). This is necessary as Xilinx FPGAs are not
to reconfigure, while the ZCU106 slots are 0.98MiB in size uniform in their placement of DSP and BRAM columns.
and take 2.9ms to reconfigure. This amounts to an average Hence, some slots must be taller and thinner to consume the
reconfiguration bandwidth of 125 MiB/s and 340 MiB/s for correct number of BRAMs and DSPs. We did not find that
the Zedboard and ZCU106 respectively. We refer to these as this difference in aspect ratio impacted the timing of the IPs
1X sized slots, and consider slots of twice as many resources within the slots. Note that we attempted to fit 12 1X sized
as well, and call them 2X slots. slots on the ZCU106, but the additional slots created routing
We consider real-world benchmarks, as provided by the congestion and we could not meet timing.
Rosetta [13] benchmark suite – 3D Rendering (3DR), Digit
Recognition (DR), and Optical Flow (OF) – and in-house 1. To find a solution in reasonable time, we limit solver time to
developed accelerators for Alexnet (AL4) and LeNet (LN) between 600 and 720 seconds. We empirically determined task graphs
of 25 vertices to be the upper bound that the ILP could attempt to solve
neural networks, and Image Compression (IMGC). All before we require partitioning, as described in Section 4.3. The parti-
benchmarks were developed with Xilinx HLS tools, and we tioner uses cuts of size 10 to 15, and defaults to fair-cut method.
Authorized licensed use limited to: R V College of Engineering. Downloaded on December 10,2024 at 14:49:42 UTC from IEEE Xplore. Restrictions apply.
DHAR ET AL.: DML: DYNAMIC PARTIAL RECONFIGURATION WITH SCALABLE TASK SCHEDULING FOR MULTI-APPLICATIONS ON... 2585

As we can see, batch size can have a significant impact on


the effective latency as it helps amortize the cost of DPR. For
a batch size of 1, many applications are unable to effectively
mask the DPR latency. This is very clear in LN and IMGC,
where the latency of DPR can be greater than that of the IPs
or the application itself. In the case of AL4, the size of the
application requires multiple PRs to be done, thus incurring
more overhead. However, as the batch size increases, we
see almost all applications are able to mask the latency of
DPR. This matches our expectation and confirms that the
scheduler is performing as expected. Alexnet (AL4) per-
forms poorly even at large batch sizes, when the number of
slots available is just two. This is because its task graph is
Fig. 8. Normalized end-to-end latency of applications across batch sizes large and has many parallel branches. However, due to the
and number of available slots, for 1X sized slots on a Zynq-7000 device. large amount of task parallelism available in AL4, it is able
For ease of presentation, IMGC batch size-1 has been cut-off, and
extends to 6.2X.
to effectively utilize the additional slots. The remaining
apps do not benefit much beyond two slots when we apply
bulk batching, due to limited task parallelism. Finally,
Figs. 8 and 9 show that the reduced DPR latency on the
Zynq Ultrascale+ provides a significant reduction in over-
head Overall, with the exception of AlexNet (AL4), we were
able to match the baseline performance, despite DPR over-
heads, given a large enough batch or a sufficient number of
slots with bulk batching alone.

5.3 Exploring Batching Strategies


Next, we consider the performance of our generated sched-
ules across different batching strategies. For the sake of brev-
ity, we will not sweep across the number of slots, and will
consider the performance of the Zedboard with its four slots,
Fig. 9. Normalized end-to-end latency of applications across batch sizes and the ZCU106 with its ten slots. We consider batch sizes of
and number of available slots, for 1X sized slots on a Zynq Ultrascale+. 4 and 32, and explore five scenarios – Bulk batching, Pipelin-
ing only, Pipelining and four-way parallel batching (Parallel
The runtime is run on the baremetal platform provided 4X), Pipelining and eight-way parallel batching (Parallel 8X),
by Xilinx. We utilize APIs available in the Board Support and No DPR. No DPR is a hypothetical scenario where we
Package (BSP) to configure slots via the Processor Configu- do not use DPR. Instead, we statically fill 80% of the FPGA
ration Access Port (PCAP), which is the CAP connected to with as many copies as possible of the entire task graph to
the PS on the SoC. The global DPR order, IP slot mappings, mimic a data-parallel approach to computing. We assume
and dependencies are generated by the ILP static scheduler 20% of the FPGA’s resources are needed for units like AXI
as global array definitions in a header file, which is read by crossbars, memory controllers, etc. Figs. 10 and 11 presents
the runtime when executing the application(s). the end-to-end application speedup over the baseline, which
assumes bulk batching as well but no DPR.
Enabling pipelining allows the scheduler to effectively
5.2 Exploring the Impact of DPR on Performance overlap multiple IP executions. The amount of overlapping
We begin by exploring the impact of batch size and number is determined by DPR latency, the DAG topology, the num-
of slots. We first present the application’s end-to-end latency, ber of slots available, and the batch size. Thus, we see that
as reported by the scheduler, normalized against the end-to- pipelining alone is able to speedup execution by up to 2.2X
end latency of the application without DPR overheads (base- and 2.73X on the Zedboard and ZCU106, respectively for a
line). For now, we consider bulk batching. Figs. 8 and 9 pres- batch size of 32. On an average, we see speedups of 1.5X
ent the normalized latency of applications, when mapped to and 1.8X on the Zed and ZCU106, respectively. Note that
1X sized slots, as mapped to a Zynq-7000 (Zedboard) and a for small batch sizes, pipelining is unable to provide speed-
Zynq Ultrascale+ (ZCU106) device. Applications that dem- ups for applications like IMGC, where the latency of DPR
onstrate a normalized latency of 1.0 are operating completely dominates execution.
unperturbed by DPR overheads. In this study, we sweep the By enabling parallelization, we unlock even more oppor-
size of the batch, as well as the number of available slots. tunities for the scheduler. Note, however, the lack of avail-
These results use HLS estimated latencies and measured able slots limits its ability to fully exploit the parallelism,
DPR time. The HLS estimated latencies for all 6 applications and the increased task graph size forces us to use graph par-
range from 7.18ms to 1.91 minutes. The DPR time of a 1X slot titioning, which further limits the performance of the sched-
on the Zedboard and ZCU106 is 9.5ms and 2.9ms respec- ules. As we can see on the Zedboard, parallelism provides
tively. This is why Figs. 8 and 9 show such diversity in the up to 3.9X speedups, with an average of 2.67X. Note that the
performance impact of DPR. performance of eight-way parallel batching is poor, due to
Authorized licensed use limited to: R V College of Engineering. Downloaded on December 10,2024 at 14:49:42 UTC from IEEE Xplore. Restrictions apply.
2586 IEEE TRANSACTIONS ON COMPUTERS, VOL. 71, NO. 10, OCTOBER 2022

Fig. 10. Impact of batching strategies. Relative speedup shown for batch Fig. 12. Impact of different batching strategies. Slot utilization shown for
size 4 and 32 on a Zedboard, as predicted by the scheduler. No DPR sol- batch size 4 and 32 on a Zedboard, as predicted by the scheduler.
utions cannot fit a single instance of IMGC, OF, and AL4 in a Zedboard.
Note: 8-way parallel batching cannot be done on batch of 4.

Fig. 13. Impact of different batching strategies. Slot utilization shown for
batch size 4 and 32 on a ZCU106, as predicted by the scheduler.
Fig. 11. Impact of batching strategies. Relative speedup shown for batch our utilization at 50% on average, while enabling paralleli-
size 4 and 32 on a ZCU106, as predicted by the scheduler. DR on batch zation brings it up to 84% on average, and up to 99%. The
of 32 in No DPR scenario is clipped for presentation, and it extends to
31.8X. Note: 8-way parallel batching cannot be done on batch of 4. ZCU106 has more resources, and can be harder to keep
busy for small applications and batch sizes. However, with
the lack of available slots and graph partitioning. In con- the help of parallelization, we are able to effectively utilize
trast, we see up to 6.8X speedup on the ZCU106 with the 53% on average, and up to 88%. Note that parallelization
help of eight-way parallel batching, and 4.15X on average. approach forces the scheduler to partition and perform
Thus, our scheduler is able to effectively utilize the available localized scheduling which limits the efficiency.
FPGA resources and parallelism in the graphs and batches.
We also note that the relatively limited resources of the 5.3.1 HW Evaluation
Zedboard does not allow many of the applications to map Having demonstrated the efficacy of our scheduler and
to it. Thus, we see that the No DPR solution was unable to its different batching strategies, we will now evaluate
provide a solution for OF, IMGC, and AL4. This further them on real systems. Figs. 14 and 15 show the speedup
highlights the need for our DML strategy, which allows com- with the same baseline for bulk-batching, pipelining,
pute to be mapped efficiently to any device. In cases where the four-way parallel batching, and eight-way parallel batch-
applications do fit, our pipelining and parallelization approach ing on the Zedboard and ZCU106 respectively. Once
is able to perform better by better utilization of the FPGA again, we observe that pipelining and parallelism can
resources. On the ZCU106, which has significantly more greatly increase the performance of partial reconfigura-
resources, we see that for small batch sizes, it might be advan- tion applications on our hardware implementation. On
tageous to simply instantiate copies of the application (No the Zedboard for batch 32 we observe an average
DPR) if the application is very small, such as DR. However, speedup of 2.87X across all applications and a max
given enough parallelism, we see that our scheduler is still speedup of 3.98X. On the ZCU106 for batch 32, we
able to better utilize the FPGA resources, even on the ZCU106. observe an average speedup of 4.99X and a max speedup
Next, we examine the effective utilization of resources by of 7.65X. Here we observe that the speedup seen in the
our methodology by considering the slot utilization. Effective hardware implementation is higher than that predicted
slot utilization measures what percentage of available slots by the scheduler. For a batch size of 32, the average
were used over the run of the application on an average. A speedup across all applications as measured on hard-
utilization of 100% would imply all slots were used across ware is 1.04X and 1.12X higher than that predicted by
the run. Figs. 12 and 13 present our analysis. the scheduler for the Zedboard and ZCU106 respectively.
Once again we see the effectiveness of our scheduling This can be attributed to the IPs running slower in hard-
solutions. On the Zedboard, pipelining alone is able to keep ware than predicted by Vivado HLS. Thus, DPR latency
Authorized licensed use limited to: R V College of Engineering. Downloaded on December 10,2024 at 14:49:42 UTC from IEEE Xplore. Restrictions apply.
DHAR ET AL.: DML: DYNAMIC PARTIAL RECONFIGURATION WITH SCALABLE TASK SCHEDULING FOR MULTI-APPLICATIONS ON... 2587

Fig. 14. Impact of different batching strategies. Relative speedup shown Fig. 16. Impact of slot size on performance. Relative speedup shown for
for batch size 4 and 32 on a Zedboard, as measured on hardware. Note: batch size 4 and 32 on a ZCU106 with 4-way parallel batching, for slot
8-way parallel batching cannot be done on batch of 4. size 1X and 2X.

Fig. 15. Impact of different batching strategies. Relative speedup shown Fig. 17. Impact of slot size on performance. Relative speedup shown for
for batch size 4 and 32 on a ZCU106, as measured on hardware. Note: batch of 4 and 32 on a ZCU106 with pipelining, for slot size 1X and 2X.
8-way parallel batching cannot be done on batch of 4.
5.4 Choosing Slot Size
is shorter, relative to the total runtime of the IP, in hard- So far, in this section we have only considered 1X sized
ware. This reduces the relative overhead of DPR, making slots. We will now explore the impact of choosing a larger
it easier to hide. Hence, the latency predicted by the slot size. Figs. 16 and 17 show the speedups for both 1X and
scheduler is longer, which results in the predicted 2X slot sizes on the ZCU106 when performing pipelining
speedup being more pessimistic. In the case of our paral- and four-way parallel batching respectively. The 1X slot
lelization strategies, which provide the best performance, design is using ten slots while the 2X design is using four
frequent PRs must be performed, and thus the average slots. As one can see, using 1X slots always achieves a
speedup predicted by the scheduler is slightly lower higher or the same speedup when compared to the 2X slots.
than what we observe in hardware. This is for several reasons. First, not all IPs are able to fill the
Comparing Figs. 14 and 15 with Figs. 10 and 11 shows that 2X slot, wasting precious resources. Second, 1X gives more
the performance trends are the same in both the scheduler flexibility for how IPs can be mapped across slots in time
and the hardware. In all but two instances, the best perform- and space, as there are more IPs and more slots. This gives
ing optimization, as predicted by the scheduler, is the best the ILP scheduler a larger space to find a high-performance
optimization in hardware, on all boards and batch sizes. This schedule. Third, 2X slots take almost twice as long to recon-
means that the scheduler’s performance model is able to figure, thereby increasing the impact of DPR latency.
effectively model the performance of the hardware imple- Finally, 1X slots give more fine-grained specialization than
mentation. The two discrepancies are LeNet with a batch of 4 2X, allowing each IP to be more specialized to the specific
on the Zedboard, and LeNet with a batch of 32 on the computation it is performing.
ZCU106. This is because LeNet contains small IPs which
have a very low latency, making it more difficult to hide the
latency of DPR. As mentioned previously, the cost of DPR in 5.5 Scalability and Multiple Applications
the scheduler’s performance model is higher than that in the We now demonstrate the scalability of our solution, and
hardware. As LeNet already has difficulty masking the used our scheduler to simultaneously map ten applications
latency of DPR, in both discrepencies, the scheduler predicts (The six application previously mentioned plus four syn-
a solution with less DPR (pipelining with batch 4 and four- thetic benchmarks) to a single FPGA, across varying batch,
way parallel batching with batch 32) would be more perform- resource, and slot sizes, and evaluate our fair-cut and
ant. However, we see in the hardware that LeNet is actually sorted-cut methods. Table 1 presents the scale of the prob-
able to hide the DPR latency, and is most performant with the lem, for a batch size of 32, with eight available slots, with
maximum amount of parallelism available for either batch. pipelining enabled. Since our scheduler runs ILP on several
Authorized licensed use limited to: R V College of Engineering. Downloaded on December 10,2024 at 14:49:42 UTC from IEEE Xplore. Restrictions apply.
2588 IEEE TRANSACTIONS ON COMPUTERS, VOL. 71, NO. 10, OCTOBER 2022

TABLE 1 TABLE 3
Problem Size for Mapping Multiple Applications Speedups on Hardware and Scheduler Over Coarse-Grained
Multi-App Scheduling for Four Apps
SlotSize Time(s) ILP(s) Node MaxVar MinVar AvgVar
1X 768.3 23.4 178 3735 2808 3657.5 Fair Sorted
2X 318.9 22.1 83 3735 1068 3290.5 Batch Sched HW Sched HW
4 1.30x 1.34x 1.44x 1.28x
16 1.40x 1.30x 1.64x 1.48x
cuts of the graph, we present the average number of ILP 32 1.42x Out-of-Memory 1.68x Out-of-Memory
variables that are solved, along with the max and min. We
also list the total time taken to find the final solution, as well
as the total time spent solving the ILP alone. even within smaller graph cuts. Here we note that the sorted
Table 2 presents the normalized speedup of mapping and fair cut strategies perform similarly. In this group of
multiple applications on to a single FPGA, as predicted by applications, a few select applications dominate the end-to-
the scheduler. We consider FPGAs ranging from a small end runtime. Thus, no matter how we perform the graph
edge-scale device with four slots, to a large cloud-scale with cuts, the critical path is determined by the same set of verti-
sixteen slots, and consider slot sizes of 1X and 2X, batch ces. Also, we can see that for larger batch sizes and more
sizes of 4 to 32. Here we consider coarse-grained multi-app slots, the effective slot utilization is poor. This is due to the
scheduling, as discussed in Section 4.3 to be our baseline, large disparity in task graphs. Larger graphs, with long
wherein each application executes serially, incurring one latencies and limited task-parallelism, consume the tail end
time reconfiguration cost for each application, and assume of the schedule, and only require one or two slots.
that there are enough resources for the entire application to Once again, we will validate our scheduler by testing it in
fit. Using this baseline, we present the speedup, as reported hardware. We consider four applications, OF, LN, AL4, and
by our scheduler for the sorted-cut and fair-cut schemes. 3DR, running simultaneously on the ZCU106, with ten 1X
We observe that our schedule is faster than the baseline slots. We present the end-to-end speedup as measured on
across almost every case. This is in part due to our sched- the board versus the scheduler prediction in Table 3. As we
uler’s ability to pipeline and overlap the execution of IPs, can see, the hardware performance once again matches the
predicted scheduler performance. Note that the ZCU106
TABLE 2 did not have sufficient memory (off-chip DRAM) to host all
Mapping Multiple Applications four applications with a batch size of 32. In addition, unlike
our 10 application experiment, the sorted cut strategy out-
1X 2X performs the fair-cut strategy as the end-to-end latency is
BatchSize Slots Fair Sorted Fair Sorted not dominated by a single application in the tail-end.
Speedup
5.6 Dependent versus Independent Scheduling for
4 0.93x 0.93x 1.51x 1.49x Multiple Applications
8 1.16x 1.15x 2.14x 2.36x We will now explore the impact of different scheduling
10 1.21x 1.15x 2.48x 2.44x
approaches for multiple applications. As discussed in Sec-
4 16 1.22x 1.19x 2.57x 2.49x
tion 4.3, the DML framework allows for two multi-application
4 1.15x 1.15x 1.51x 1.50x scheduling schemes: independent-scheduling and dependent-
8 1.47x 1.43x 2.13x 2.42x scheduling. We use both scheduling schemes to generate
10 1.52x 1.44x 2.48x 2.52x multi-app schedules for four concurrently running applica-
16 16 1.52x 1.49x 2.55x 2.53x
tions: LeNet, AlexNet, Optical Flow, and 3D Rendering, and
4 1.16x 1.16x 1.51x 1.50x consider the total end-to-end latency of all four applications,
8 1.48x 1.43x 2.13x 2.43x and the end-to-end latency of each application. We run our
10 1.53x 1.45x 2.48x 2.53x experiments on the ZCU106 board with ten 1X slots, and a
32 16 1.53x 1.50x 2.55x 2.54x batch size of 16. For independent scheduling, we consider
Slot Utilization three different allocations of slots per application. As dis-
cussed in Section 4.3, in dependent scheduling, the scheduler
4 0.64 0.63 0.89 0.86
8 0.41 0.41 0.61 0.67 will allocate slots to IPs from the global pool and try to opti-
10 0.34 0.33 0.59 0.56 mize the total end-to-end latency.
4 16 0.21 0.21 0.38 0.35 Fig. 18 shows the end-to-end speedup of the indepen-
dent-schedule over the dependent-schedule for the total
4 0.66 0.64 0.88 0.86
runtime of all application and the per-application runtime.
8 0.43 0.43 0.6 0.68
10 0.37 0.35 0.59 0.57 The per-application slot allocations are shown in the legend.
16 16 0.23 0.23 0.38 0.36 Independent scheduling always has worse end-to-end
latency, and on average increases the end-to-end latency by
4 0.66 0.64 0.88 0.86 1.74X. This is because the dependent scheduling optimizes
8 0.44 0.43 0.60 0.68
over a monolithic graph across all available slots, so it can
10 0.37 0.35 0.59 0.58
32 16 0.23 0.23 0.39 0.36 prioritize the applications with the longest latency. How-
ever, dependent scheduling only optimizes the end-to-end
Authorized licensed use limited to: R V College of Engineering. Downloaded on December 10,2024 at 14:49:42 UTC from IEEE Xplore. Restrictions apply.
DHAR ET AL.: DML: DYNAMIC PARTIAL RECONFIGURATION WITH SCALABLE TASK SCHEDULING FOR MULTI-APPLICATIONS ON... 2589

Fig. 18. Speedup of independent over dependent scheduling for total


Fig. 20. Effect of number of slots on the performance of Alexnet (AL4)
end-to-end latency and on a per-application basis. We consider a mix of
and Optical Flow (OF) benchmarks.
four application, with a batch size of 16, and measure performance on a
ZCU106 in hardware. Legend denotes the number of slots provided to
each application as follows: NAME: SLOTS.
actually hinders the performance. This is due to two rea-
latency, which can severely hurt single application perfor- sons: First, these three applications consist of 15 IPs, which
mance by unfairly providing slots to the slowest applica- is small enough for the dependent scheduling to find an
tion. Fig. 18 shows that when running four applications, optimal solution, without partitioning, for the entire DAG.
providing two dedicated slots to LeNet gives a speedup of Second, these applications have very limited branches, and
14.4X, and providing a single dedicated slot to LeNet gives do not require a larger number of slots for maximum perfor-
a speedup of 9.8X over if they were dependently scheduled. mance. In this scenario, there is less contention for slots,
This can also be seen in 3DR where independently schedul- which results in improved per-application performance.
ing provides a speedup of 15.7X and 10.7X for two and one Thus, it is better to use dependent scheduling.
dedicated slots respectively. These speedups are so large as
depdendent scheduling will prioritize applications with the 5.7 Impact of Number of Slots on Scheduling
longest latency, which is AlexNet. AlexNet has 38 IPs which Multiple Applications
is much more than Optical Flow’s nine IPs, LeNet’s three We have just shown that the best multi-app scheduling
IPs, or 3DR’s three IPs. AlexNet also has parallel branches scheme is highly dependent on the topology of DAGs in the
in its DAG, allowing it to consume many slots at the same applications. To further quantify this phenomenon, Fig. 20
time to further increase performance. Thus, the dependent shows the performance impact of increasing the number of
scheduler will give a large majority of the slots to AlexNet slots for two different batch sizes for our two applications
to reduce its end-to-end latency as much as possible. While with the most number of IPs– AlexNet and Optical Flow –
this is good for end-to-end latency as well as the latency of both run with pipelining and no parallel batching. AlexNet
AlexNet, one can see it unfairly hurts the performance of has 38 IPs with many parallel branches, while Optical Flow
other applications. In this case, it is beneficial to use inde- has 9 IPs and has no parallel branches. Fig. 20 shows the
pendent scheduling to share the slots across the four performance of AlexNet continues to increase as we
applications. increase the number of slots. This is due to the fact that
In contrast, we will now consider cases where dependent AlexNet has many parallel branches, so it can utilize all
scheduling is better. Fig. 19 shows the speedup that inde- slots provided and provides a maximum speedup of 6.60X
pendent scheduling has over dependent scheduling when when using twelve slots. On the other hand, Optical Flow
scheduling three applications: LeNet, Optical Flow, and sees no increase in performance after four and six slots
3DR. In this scenario, we see that independent scheduling when using a batch of four and sixteen, respectively. This is
because there are no branches in Optical Flow, and the
scheduler can reuse the slots and obtain the same perfor-
mance. Increasing the batch size increases the number of
IPs which can run concurrently, however, even with a batch
of 16 the speedup when using twelve slots is only 2.60x,
despite there being nine IPs in the application. This trend
can be used to help determine if dependent or independent
scheduling is better for a given set of applications.

5.8 Comparison With Previous Work


We compare our work with an ILP-based approach and
heuristic optimized approach [4], [5] – Iterative Scheduling
(IS), which limits the number of tasks per itertation to 1 (IS-
Fig. 19. Speedup of independent over dependent scheduling for total 1) and 5 (IS-5). The results of this comparison can be seen in
end-to-end latency and on a per-application basis. We consider a mix of Table 4. Since we have used a novel architecture in this
three application, with a batch size of 16, and measure performance on
a ZCU106 in hardware. Legend denotes the number of slots provided to work, and real-world benchmarks, it is difficult to perform
each application as follows: NAME: SLOTS. a fair one-to-one comparison with [4], [5] where they use
Authorized licensed use limited to: R V College of Engineering. Downloaded on December 10,2024 at 14:49:42 UTC from IEEE Xplore. Restrictions apply.
2590 IEEE TRANSACTIONS ON COMPUTERS, VOL. 71, NO. 10, OCTOBER 2022

TABLE 4 [3] G. Charitopoulos, I. Koidis, K. Papadimitriou, and D. Pnevmatika-


Comparative Performance of Our ILP Solution tos, “Hardware task scheduling for partially reconfigurable
FPGAs,” in Proc. Appl. Reconfigurable Comput., 2015, pp. 487–498.
[4] A. Purgato, D. Tantillo, M. Rabozzi, D. Sciuto, and M. D. Santam-
This Work [5] [4]
brogio, “Resource-efficient scheduling for partially-reconfigurable
Nodes Time(s) Num Nodes IS-1 IS-5 PAR/IS-5 FPGA-based systems,” in Proc. IEEE Int. Parallel Distrib. Process.
Symp. Workshops, 2016, pp. 189–197.
7 3.6 10 0.95 34.70 4.730 [5] E. A. Deiana, M. Rabozzi, R. Cattaneo, and M. D. Santambrogio,
28 18.06 30 26.02 764 90.30 “A multiobjective reconfiguration-aware scheduler for FPGA-
56 39.8 50 115.6 393 135.36 based heterogeneous architectures,” in Proc. Int. Conf. ReConFigur-
able Comput. FPGAs, 2015, pp. 1–6.
[6] R. Cordone et al., “Partitioning and scheduling of task graphs on
partially dynamically reconfigurable FPGAs,” IEEE Trans. Com-
synthetic benchmarks. Thus, we use our four synthetic test put.-Aided Des. Integr. Circuits Syst., vol. 28, no. 5, pp. 662–675,
May 2009.
benchmarks as well, and target the same system: a Zedboard [7] F. Redaelli, M. D. Santambrogio, and D. Sciuto, “Task scheduling
with four 1X slots. An ILP-based solution will provide an with configuration prefetching and anti-fragmentation techniques
optimal (or near-optimal for the iterative schedulers in [5] on dynamically reconfigurable systems,” in Proc. Des. Automat.
and [4]) solution. Thus, we restrict our comparison to ILP Test Eur., 2008, pp. 519–522.
[8] F. Redaelli et al., “An ILP formulation for the task graph scheduling
runtimes only, for similar number of nodes in the graphs. problem tailored to bi-dimensional reconfigurable architectures,”
Since [5] and [4] do not have pipelining support, we set our in Proc. Int. Conf. Reconfigurable Comput. FPGAs, 2008, pp. 97–102.
batch size to one to disable any pipelining optimization. As [9] A. Agne et al., “ReconOS: An operating system approach for
reconfigurable computing,” IEEE Micro, vol. 34, no. 1, pp. 60–71,
we can see, the simplification of our ILP constraints allows
Jan./Feb. 2014.
us to tackle large problems faster and more efficiently, hence [10] A. Rodrıguez et al., “FPGA-based high-performance embedded sys-
proving the efficacy of our approach. Note that IS-1 is faster tems for adaptive edge computing in cyber-physical systems: The
than us; however, the approach schedules one task in the ARTICO3 framework,” Sensors, vol. 18, no. 6, 2018, Art. no. 1877.
[11] B. Seyoum et al., “Automating the design flow under dynamic
queue at a time, which limits the quality of the solution. In partial reconfiguration for hardware-software co-design in FPGA
contrast, we attempt to schedule up to 25 task nodes first, soc,” in Proc. Annu. ACM Symp. Appl. Comput., 2021, pp. 481–490.
before partitioning the graph into cuts of 15 tasks. [12] M. Rabozzi et al., “Floorplanning automation for partial-reconfig-
urable FPGAs via feasible placements generation,” IEEE Trans.
Very Large Scale Integr. Syst., vol. 25, no. 1, pp. 151–164, Jan. 2017.
[13] Y. Zhou et al., “Rosetta: A realistic high-level synthesis benchmark
6 CONCLUSION suite for software programmable FPGAs,” in Proc. ACM/SIGDA
Int. Symp. Field-Program. Gate Arrays, 2018, pp. 269–278.
In this work we presented DML, an end-to-end DPR sched- [14] Xilinx. Zynq-7000 SoC Data Sheet: Overview. Accessed: Jul. 2,
2018. [Online]. Available: https://www.xilinx.com/support/
uling and mapping solution. DML is generic, considers
documentation/data_sheets/ds190-Zynq-7000-Overview.pdf
FPGAs of all sizes, provides a scalable and portable archi- [15] Xilinx. Zync UltraScale MPSoC Data Sheet: Overview. Accessed:
tecture that reduces design effort, and includes a novel ILP- May 26, 2021. [Online]. Available: https://www.xilinx.com/
based scheduler that provides the lowest latency schedule, support/documentation/data_sheets/ds891-zynq-ultrascale-plus-
overview.pdf
performs pipelining and parallelization across batch ele- [16] Xilinx, Partial Reconfiguration Flow on Zynq using Vivado, [Online].
ments, and is capable of simultaneously scheduling and Available: https://www.xilinx.com/support/university/vivado/
mapping multiple applications at once. vivado-workshops/Vivado-partial-reconfiguration-flow-zynq.html
We demonstrated the efficacy of our solution via an [17] S. DiamondandS. Boyd, “CVXPY: A python-embedded modeling
language for convex optimization,” J. Mach. Learn. Res., vol. 17,
extensive design space exploration via our scheduler, and no. 83, pp. 1–5, 2016.
validated our methodology on edge and cloud scale FPGAs [18] Gurobi optimizer, Gurobi, v8.1. [Online]. Available: https://www.
– a Zedboard and a ZCU106. Our evaluation demonstrated gurobi.com/products/gurobi-optimizer/
our scheduler’s ability to pipeline and parallelize the solu-
tion with an average speedup of 5X and up to 7.65X on a
Ashutosh Dhar (Member, IEEE) received the MS and PhD degrees
ZCU106. Finally, we explored the trade-offs between simul- from the University of Illinois, Urbana-Champaign. His research interests
taneously mapping multiple applications to a single FPGAs include computer architecture, reconfigurable, heterogeneous architec-
versus partitioning and allocating resources to each applica- tures, and application of reconfiguration in conventional architectures.
tion, individually.

ACKNOWLEDGMENTS Edward Richter (Student Member, IEEE) received the BS and MS


degrees in electrical and computer engineering from the University of
Edward Richter and Mang Yu have contributed equally to Arizona. He is currently working toward the PhD degree with the Elec-
this work. trical and Computer Engineering Department, University of Illinois,
Urbana-Champaign. His research interests include utilizing reconfigur-
able computing platforms for acceleration and enabling architectural
and system-wide research.
REFERENCES
[1] B. Hutchings and M. Wirthlin, “Rapid implementation of a par-
tially reconfigurable video system with PYNQ,” in Proc. Int. Conf. Mang Yu received the BS and MS degrees from the University of Illinois,
Field Program. Logic Appl., 2017, pp. 1–8. Urbana-Champaign. His research focuses on utilizing machine learning
[2] D. Koch et al., “Partial reconfiguration on FPGAs in practice - methods to improve accelerator designs for reconfigurable computing
Tools and applications,” in Proc. IEEE ARCS, 2012, pp. 1–12. platforms.

Authorized licensed use limited to: R V College of Engineering. Downloaded on December 10,2024 at 14:49:42 UTC from IEEE Xplore. Restrictions apply.
DHAR ET AL.: DML: DYNAMIC PARTIAL RECONFIGURATION WITH SCALABLE TASK SCHEDULING FOR MULTI-APPLICATIONS ON... 2591

Wei Zuo received the MS degree from the Electronics and Communica- Deming Chen (Fellow, IEEE) received the BS degree in computer sci-
tion Engineering Department, University of Illinois, Urbana-Champaign, ence from the University of Pittsburgh in 1995, and the MS and PhD
where she is currently working toward the PhD degree with Electrical and degrees in computer science from the University of California, Los
Computer Engineering Department. Her research interests include hard- Angeles, in 2001 and 2005, respectively. He is currently the Abel Bliss pro-
ware-software co-design, developing automated frameworks for accurate fessor of engineering with the Electronics and Communication Engineer-
system-level modeling, and efficient hardware–software partitioning for ing Department, University of Illinois, Urbana-Champaign (UIUC). His
applications mapped to SoC platforms. research interests include system-level and high-level synthesis, machine
learning, computational genomics, reconfigurable computing, and hard-
ware security. He was the recipient of the UIUC’s Arnold O. Beckman
Xiaohao Wang received the BS and MS degrees from the University Research Award, NSF CAREER Award, nine best paper awards, ACM
of Illinois, Urbana-Champaign. His research focuses on hardware SIGDA Outstanding New Faculty Award, and IBM Faculty Award. He is
systems. an ACM distinguished speaker and the editor-in-chief of ACM Transac-
tions on Reconfigurable Technology and Systems.

Nam Sung Kim (Fellow, IEEE) is currently a professor with the Univer-
sity of Illinois, Urbana-Champaign. He has authored or coauthored " For more information on this or any other computing topic,
more than 200 refereed articles to highly-selective conferences and please visit our Digital Library at www.computer.org/csdl.
journals in the field of digital circuit, processor architecture, and com-
puter-aided design. The top three most frequently cited papers have
more than 4000 citations and the total number of citations of all
his papers exceeds 11,000. He was the recipient of the IEEE Interna-
tional Symposium on Microarchitecture (MICRO) Best Paper Award in
2003, NSF CAREER Award in 2010, and ACM/IEEE Most Influential
International Symposium on Computer Architecture (ISCA) Paper
Award in 2017. He is a hall of fame member of IEEE International
Symposium on High-Performance Computer Architecture (HPCA),
MICRO, and ISCA. He is a fellow of ACM.

Authorized licensed use limited to: R V College of Engineering. Downloaded on December 10,2024 at 14:49:42 UTC from IEEE Xplore. Restrictions apply.

You might also like