DML Dynamic Partial Reconfiguration With Scalable Task Scheduling For Multi-Applications On FPGAs
DML Dynamic Partial Reconfiguration With Scalable Task Scheduling For Multi-Applications On FPGAs
Abstract—For several new applications, FPGA-based computation has shown better latency and energy efficiency compared to CPU or
GPU-based solutions. We note two clear trends in FPGA-based computing. On the edge, the complexity of applications is increasing,
requiring more resources than possible on today’s edge FPGAs. In contrast, in the data center, FPGA sizes have increased to the point
where multiple applications must be mapped to fully utilize the programmable fabric. While these limitations affect two separate domains,
they both can be dealt with by using dynamic partial reconfiguration (DPR). Thus, there is a renewed interest to deploy DPR for FPGA-based
hardware. In this work, we present Doing More with Less (DML) – a methodology for scheduling heterogeneous tasks across an FPGA’s
resources in a resource efficient manner while effectively hiding the latency of DPR. With the help of an integer linear programming (ILP)
based scheduler, we demonstrate the mapping of diverse computational workloads in both cloud and edge-like scenarios. Our novel
contributions include: enabling IP-level pipelining and parallelization to exploit the parallelism available within batches of work in our
scheduler, and strategies to map and run multiple applications simultaneously. We consider the application of our methodology on real world
benchmarks on both small (a Zedboard) and large (a ZCU106) FPGAs, across different workload batching and multiple-application
scenarios. Our evaluation proves the real world efficacy of our solution, and we demonstrate an average speedup of 5X and up to 7.65X on a
ZCU106 over a bulk-batching baseline via our scheduling strategies. We also demonstrate the scalablity of our scheduler by simultaneously
mapping multiple applications to a single FPGA, and explore different approaches to sharing FPGA resources between applications.
Index Terms—Partial reconfiguration, integer linear programming, scheduling, FPGA, dynamic reconfiguration
0018-9340 © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See ht_tps://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: R V College of Engineering. Downloaded on December 10,2024 at 14:49:42 UTC from IEEE Xplore. Restrictions apply.
2578 IEEE TRANSACTIONS ON COMPUTERS, VOL. 71, NO. 10, OCTOBER 2022
and not by how many resources within the DPR region are
in use. Note, that while we focus on Xilinx Zynq and Xilinx
Zynq Ultrascale+ [14], [15] series FPGA-SoCs in this work,
our solution is not limited to Xilinx devices. Fig. 1 presents
the system architecture we have designed in this work to
help overcome the challenges in leveraging DPR. We desig-
nate each slot as a resource partition that includes a recon-
figurable partition and a fixed interface. All slots are
Fig. 1. Proposed system architecture.
uniform in their resources and since DPR requires slot inter-
faces to be uniform, we use AXI-based buses to create their
latency of a given application or group of applications, interfaces. The static region of the FPGA hosts the global
DARTS goal is timing predictability and works towards a AXI interconnect, to which the slots connect.
solution for a user-provided time constraint. Furthermore, To map an application to our architecture, we partition it
DML splits up hardware applications into IPs, which are at a task level and represent it as a task graph. The task graph
independently mapped and scheduled with the precedence is a directed acyclic graph (DAG), GðV; EÞ, such that each
constraints of the task-graph represented by the constraints vertex, vi 2 V , is a task, and each edge, eij 2 E, represents a
of the ILP. DML can then leverage automatic optimizations dependency between tasks such that vi must complete before
such as pipelining and parallelism across IPs. vj can begin execution. We then create IPs for each task, and
In comparison to prior attempts, our scheduler distin- assign the latency of the IP as the weight of the vertex in the
guishes itself with four novel features: (1) Ability to pipe- task graph. This task graph model is illustrated in Fig. 3a
line, (2) Unrolling and parallelizing across batch elements, where each vertex represents a task of the application that
(3) Graph partitioning optimization to enable the mapping has its own IP. Edge weights may be used to represent the
of very large graphs, and (4) Simultaneous scheduling and communication latency. IPs may be designed in any fashion
mapping. In addition, in this work, we do not rely on syn- and allow users to deliver fine-grained customization on a
thetic graphs. Rather, we consider multiple real-world per-task level. Alternatively, users may choose to group sev-
applications from the Rosetta [13] benchmark suite and vali- eral tasks or kernels into a single large task and IP. Once the
date our framework by testing on real hardware. Finally, we application has been represented as a task graph, it must be
demonstrate the simultaneous scheduling and mapping of scheduled and mapped to slots on the FPGA, which we dis-
multiple applications on a single FPGA and present insights cuss in the next section. A key advantage of our approach is
into what strategies work best for such scenarios. that by grouping together task graphs of multiple applica-
tions, we can create a single task graph. Thus, we can simul-
taneously schedule multiple applications on a single FPGA,
3 DOING MORE WITH LESS without changes to the applications, floorplan, or scheduler.
We now present our Doing More with Less (DML) framework Note that while DAGs by definition cannot have cycles,
– an end-to-end and generic methodology that enables any DML can address statically resolvable cyclical patterns in the
workload, or multiple workloads, to be efficiently mapped task graph by either unrolling the cycles or absorbing the
to FPGAs of all sizes, with DPR. Our solution is comprised cycles into a single node.
of two key parts. First, we partition the FPGA into uniform The size, shape, and location of the slots are determined
pieces, that we call slots and provide a scalable architecture, based on the DPR constraints of the FPGA, and their height
as shown in Fig. 1. Second, we propose an ILP-based opti- spans the entire clock region. This eliminates several of the
mizer that schedules and maps work into slots, while amor- constraints involved in 2D reconfiguration described in ear-
tizing the latency of reconfiguration by overlapping the lier works [6], [7], [8]. Slot sizes can be set to ensure that the
computation with reconfiguration. Our scheduler is capable application’s IPs are able to fit into them, if the IP library
of pipelining and parallelizing applications by leveraging already exists. Note that the size of a slot determines the
the data parallelism available across elements in batches of latency of DPR and limits the performance of an IP that can
work, and uses a graph partitioning strategy to map very be mapped into it. Should an application’s task require an
large task graphs. Finally, our flexible architecture and IP that is too large for a slot size, then we must either split
scheduler enable us to simultaneously schedule and map the IP into smaller partitions that map to more than one
multiple applications on an FPGA. slot, or we must scale back the IP’s performance and reduce
Leveraging dynamic partial reconfiguration (DPR) its resource requirements. Finally, all slots communicate via
requires manual floorplanning to carve out and designate AXI in the global address space, i.e., DRAM. Hence, we set
specific regions as static or dynamically reconfigurable. the number of slots per FPGA such that the total required
Thus, the designer must decide where to physically place DRAM bandwidth does not cause bus contention.
the accelerator. In addition, there are several architectural The use of uniform slots is a compromise we make to
constraints and design rules that must be considered. A key speedup the design time, scale across all devices, and map
limiting factor is the speed of DPR, which is determined by multiple applications to the same device simultaneously. In
the bandwidth available in the Configuration Access Port cloud-like scenarios, where multiple applications may need
(CAP), and the size of the partial bitstream. The CAP band- to be mapped, rapid deployment and reducing physical
width is architecture specific and may not be changed, how- design effort are very valuable. Thus, our architecture pro-
ever, the size of the partial bitstream is determined by the vides two key advantages: First, by employing fixed recon-
size of the dynamically configurable region (DPR Region) figuration partitions, the designer does not need to perform
Authorized licensed use limited to: R V College of Engineering. Downloaded on December 10,2024 at 14:49:42 UTC from IEEE Xplore. Restrictions apply.
2580 IEEE TRANSACTIONS ON COMPUTERS, VOL. 71, NO. 10, OCTOBER 2022
Fig. 3. (a) Example task graph. Each vertex in this graph represents a
single task, and the directed edges between vertices represent task
Fig. 2. Doing More with Less methodology and flow. dependencies. An IP is created for each vertex, and the weight of each
vertex is the latency of its corresponding IP. (b) Sample cut solutions for
a large task graph. Each cut has a max size of seven vertices.
floor planning for each application, and can design accelera-
tors with defined IO constraints, and be ensured that it will a dynamic scheduler to map any IP to any slot at runtime. This
scale across devices. Second, the fixed slot sizes simplify the adds significant design effort and time overheads.
scheduling and mapping constraints, helping to deliver
a scalable and deterministic design with uniform DPR
latency. The use of fixed slot sizes may not guarantee the 4 ILP BASED SCHEDULING
best utilization of the fabric, and requires some thought and
At the heart of the scheduler is our ILP formulation. Our
planning by the IP designer, as is the case in any IP design
goal is to find a high quality solution, while minimizing the
effort. However, we believe that the aforementioned bene-
traditional time costs of ILP-based solutions. Our ILP solu-
fits far outweigh the utilization benefits of using variable
tion performs simultaneous scheduling and mapping and
slot sizes. In addition, a different slot size can be selected to
can provide an optimal solution on reasonable graph sizes.
better suit the application(s).
Crucially, we consider real-world deployment constraints,
Fig. 2 presents the overall flow and framework of our
and include the ability to pipeline and parallelize tasks
solution. For a given FPGA, we have an overlay architecture
across batches. While, our slot-based architecture helps sim-
that determines the number of slots and DPR latency, and
plify the ILP formulation, finding a solution for the ILP can
for a given application that needs to be mapped to the
be slow and does not scale well to large graphs. Hence, we
FPGA, we have a task graph comprised of kernels. Note
use heuristic schedulers to help tune the ILP solver’s search
that this is not a computational overlay, such as a CGRA or
space, and explore different partitioning strategies to find
a systolic array. Our architecture is flexible and allows us to
scheduling solutions when trying to map large graphs. We
deliver application and task-specific specialization with
also extend our framework to support two different solu-
high performance. We begin with kernels in the task graph
tions for mapping multiple applications.
and generate IPs for them via high-level synthesis (HLS).
Our ILP solution performs simultaneous scheduling and
We then use the reported latency of the IPs, architectural
mapping, can provide an optimal solution on reasonable
parameters, the chosen level of scheduler optimization
graph sizes, and takes into account our scalable architecture,
(pipelining and parallelism factor), and the task graph as
which helps loosen the ILP constraints. Crucially, we con-
inputs to our static ILP-based scheduler. Note, DML is not
sider real-world deployment constraints, and include the
suitable for applications with IPs whose latency cannot be
ability to pipeline and parallelize tasks across batches, and
estimated prior to runtime. The scheduler then delivers a
include two different solutions for mapping multiple appli-
mapping solution and a DPR and IP execution schedule
cations to a single FPGA.
which can be executed on the hardware by the dynamic
runtime. The mapping solution, and DPR and IP execution
schedules are represented by three components: (1) global 4.1 ILP Formulation
DPR order, which is a list of IPs in the order for them to be The input to the ILP is an application task-graph, IP laten-
reconfigured on the slots (2) IP slot mappings, which map cies, DPR latency, and resource constraints. Our ILP simul-
each IP to the physical slot on the device it will be run on, taneously looks for a schedule, that provides the global
and (3) dependencies, each IP has a list of dependencies DPR order, and a mapping solution. We formally describe
extracted from the task graph and used by the runtime. We the ILP formulation as follows:
discuss the operation of this runtime in the next section. Given: (1) A task graph GðV; EÞ as described in Section 3;
In parallel, we use an automated version of the process (2) A set of scheduling constraints, Cs ; and (3) A set of
described in [16] to generate partial bitstreams, which we resource constraints, Cr . The scheduling constraints include
call the Bitstream Generator. We use synthesized design dependencies inferred from the graph, latency of nodes in
checkpoints of the IP and our custom overlay floorplan to the graph, and DPR latency. In addition, we must ensure
generate partial bitstreams. We feed the partial bitstreams only one partial reconfiguration is done at a time. The
to a runtime that executes the applications based on the pro- resource constraints are the number of available slots, as
vided mapping and schedule. This runtime is implemented provided by the user. Additionally, the user may select opti-
in software and runs on the processing system (PS) of the mizations, such as pipelining or parallelization, to be
SoC-FPGA. Finally, note that without a slot mapping solu- included. We will discuss these later in this section.
tion from our scheduler, users would need to design and Goal: Minimize the latency of the entire task graph, such
generate bitstreams for every possible IP-slot pair to enable that each task’s IP(s) are allocated a slot and experiences the
Authorized licensed use limited to: R V College of Engineering. Downloaded on December 10,2024 at 14:49:42 UTC from IEEE Xplore. Restrictions apply.
DHAR ET AL.: DML: DYNAMIC PARTIAL RECONFIGURATION WITH SCALABLE TASK SCHEDULING FOR MULTI-APPLICATIONS ON... 2581
latency of reconfiguration prior to the IP executing in the the i-th IP’s DPR and the latency of reconfiguration. Hence,
slot, such that all constraints in Cs and Cr are satisfied. we add
ILP Variables: We will now define the variables that we
Sj Si þ Li ; 8ei;j 2 E (7)
will solve to find our solution. We define the set V as all the
vertices in the given graph. The sets L and Lpr contain the Si Spri þ Lpri ; 8i 2 V: (8)
execution latency of each node in V , and the latency of
reconfiguration, respectively. Then, we define the variables Overlap Constraints: Next, we define our overlap con-
S and Spr, as timestamps, where S 2 Z are the start times of straints that ensure DPR doesn’t begin before the previous
all IPs in the set V , and Spr 2 Z are the start times of the cor- IP in the slot has completed, only one DPR can be per-
responding partial reconfigurations of each node in V . Next, formed at a time, and IPs mapped to the same slot do not
we describe our resource mapping variables. Let the num- try to overlap their execution. Note that in (9) to (14), we use
ber of available slots be Rs . Then, we define a binary vari- the variables B1ijk , B2ijk , and B3ijk as a tool to help express
able Mik as Mik ¼ 1 if the i-th IP, vi , maps to the k-th slot, an either/or inequality in a way that is amenable to ILP,
where vi 2 V and k 2 Rs . Next, since any IP can be mapped and C1 , C2 , and C3 are large enough constants that we set.
to any slot, provided the slot is not occupied, we must We further explain and provide insight into these variables
express the resource sharing between IPs. We define the later in the section.
binary variable Yijk as Yijk ¼ 1 if the i-th and j-th IP map to Our first constraint is added to enable DPR to overlap
the k-th slot, where vi 2 V , vj 2 V , and k 2 Rs . Finally, vari- with computation. If the i-th and j-th IP map to the same
ables B1ijk , B2ijk , and B3ijk are Boolean decision variables slot, and DPR of the i-th IP takes place before the j-th IP,
that we use to help encode our overlap constraints. Their then the start time of the j-th IP must be greater than the end
solution is determined by the ILP solver. We also add C1 , time of the i-th IP.
C2 , and C3 as large enough constants, and discuss how to Thus for all pairs of i and j, i 2 V , j 2 V
set them later in this section. Next, we describe our system
of equations that formulate the constraints of our problem. Spri Sj þ C1 :B1ijk Lj :Yijk ; 8k 2 Rs (9)
Legality Constraints: We encode the fundamental con-
straints of the system by defining the solution space. We Sprj Si þ C1 :ð1 B1ijk Þ Li :Yijk ; 8k 2 Rs : (10)
must enforce bounds on start times, ensure that an IP maps
to only one slot, and only allow IPs to share a slot one at a We must also ensure that if two IPs are mapped to the
time. We begin by enforcing that the start times of all opera- same slot, they cannot overlap their execution. Thus, if the i-
tions must be positive th and j-th IP map to the same slot, the start time of the i-th
IP must be greater than the end time of the j-th IP, or vice
Si > 0; 8 i 2 V (1) versa. Hence we have
Yijk Mik þ Mjk 1 (4) As we mentioned earlier, in (9) to (14), we use the varia-
bles B1ijk , B2ijk , and B3ijk as a tool to help express an either/
Yijk Mik (5) or inequality in a way that is amenable to ILP. These binary
variables encode precedence relationships between configu-
Yijk Mjk : (6) ration and execution of IPs i and j. For example, in (9),
B1ijk ¼ 1 encodes that IP i is configured before IP j is run,
Latency and Dependency Constraints: Next, we define while B1ijk ¼ 0 encodes IP j is configured before IP i is run.
our latency and edge dependency constraints. If there is an Similarly, B2ijk encodes the precedence relationship between
edge eij 2 E, from the i-th IP to the j-th IP, then the start execution times of IP i and j, while B3ijk encodes the prece-
time of the j-th IP must be greater than the sum of the start dence relationship between configuration times of IPs i and
time of the i-th IP and its latency. Also, an IP can only start j. The ILP solver finds a solution for these variables in con-
once its reconfiguration is complete. Thus, the start time of junction with all the other latency, legality, and overlap
the i-th IP must be greater than the sum of the start time of constraints.
Authorized licensed use limited to: R V College of Engineering. Downloaded on December 10,2024 at 14:49:42 UTC from IEEE Xplore. Restrictions apply.
2582 IEEE TRANSACTIONS ON COMPUTERS, VOL. 71, NO. 10, OCTOBER 2022
Fig. 4. Graphical depictions of three different schedules using the task graph in Fig. 3a with varying IP and DPR latencies.
Objective Function: In order to create our objective func- can see Eqs. (11)-(12) in action, and they allow IPs i, j, and l
tion, we create a sink node, Vsink, such that 8v 2 V there to overlap their execution, while Eqs. (9)-(10) helps allow i, k,
exists an edge between Vsink and v, and the start time of l, and m to overlap their DPR with IP executions. Note that
the sink node is Ssink . Thus our objective function is to mini- all four slots are used in this example. Finally, in Fig. 4c, we
mize the start time of the sink consider a situation where DPR latencies dominate, and thus
Eqs. (13)-(14) are key in enforcing that DPRs do not overlap,
MinðSsink Þ: (15) while Eqs. (9)-(10) improve performance by allowing DPR
and IP execution to overlap.
Leveraging Heuristics: To help express the conditional
constraints in Eqs. (9) to (14) we introduced C1 , C2 , and C3 4.2 Additional Support for Computational Workloads
as large enough constants, and introduced B1ijk , B2ijk , B3ijk Batching is commonly used in computational workloads
as Boolean decision variables, whose values are determined since each kernel needs to be run multiple times for a vari-
by the ILP solver. The constants help define the bounds of ety of inputs. It can also help amortize the cost of reconfigu-
the solution space and the value of the constant must be ration in some cases. We initially consider batching from
very close to the upper bound of the variable. Choosing too three approaches: (1) Bulk batching, wherein a single
small a value might result in an infeasible problem. To help instance of the IP is re-used for each entry in the batch. For a
find the bound, we use a heuristic list-scheduler to provide batch of size N we scale the latency of each IP by a factor of
a fast solution. Note that the list-scheduler does not consider N, thereby serializing the batch but increasing the time each
mapping solutions, is not optimal, and does not consider all IP must remain configured on the device before it can be
overlap constraints. So, we apply a safety margin to the swapped for another. (2) Parallel batching, wherein multi-
result of the list-scheduler to determine the bound. ple instances can run batch entries in parallel. We replicate
Illustrative Example: To help illustrate the operation of the graph N times, thereby creating N parallel instances,
the scheduler, we consider the task graph shown in Fig. 3a which can potentially lead to better slot utilization but
and present the resulting mapping and schedules from our blows up the size of the ILP problem, making it harder to
scheduler in three different scenarios with different IP and find a solution. Note that both of these approaches are pos-
DPR latencies in Fig. 4. For brevity, we restrict our example sible with our described ILP formulation. (3) Pipelining
to four slots, and consider scenarios where: (1) DPR latency across batches, wherein like bulk batching, each IP’s latency
dominates execution time (Fig. 4a), (2) DPR latency is easily is scaled by a factor of N; however, we allow dependent IP
masked by IP execution latencies (Fig. 4b), and (3) DPR and to overlap their execution since the dependencies exist
IP and latencies have varied ratios Fig. 4c. For consistency, within a batch entry only. Thus, each pipeline stage is an IP,
we use the same notations as presented in Section 4.1. operating on a separate entry in the batch. In order to do so,
Fig. 4 presents the solutions generated by our ILP formu- we extend our ILP formulation. If there is an edge eij 2 E,
lation. The generated solutions illustrate that our static ILP- from the i-th IP to the j-th IP, then the start time of the j-th IP
based scheduler will always try to minimize the total execu- must be greater than the sum of the start time of the i-th IP
tion time. Note that the ILP-based scheduler can find optimal and the latency of one entry in the batch. Also, the penulti-
solutions for reasonable problem sizes. However, given that mate batch of the j-th IP must begin only after the last batch
our target objective is to find the best performing order and of the i-th IP has completed. Thus we have
mapping, it is possible that multiple order-mapping solu-
tions provide the best performance. Thus, the delivered solu- Sj Si þ Li =N; 8 i; j 2 V (16)
tion may not be intuitive. For example, in Fig. 4a, we can see
that the scheduler has opted not to use all four available slots, Sj þ ðN 1Þ:Lj =N Si þ Li ; 8 i; j 2 V: (17)
as it can achieve the minimum latency with just three. This is
because IP m is dependent on IPs j, k, and l, and thus i and k Note that we assume that the latency of the i-th IP, Li has
form the critical path such that their execution latencies are already been scaled by a factor of N. However, should the
easily able to mask DPR latency. Moving IP l or IP m into slot latency of the j-th IP be much smaller than the i-th IP, we
0 would not have improved performance, but it would be must ensure that it still completes after the i-th IP and
another solution. Meanwhile, in Fig. 4b, we consider a situa- respects the dependency
tion where IP j is on the critical path, but the latency of DPR
and the remaining IPs remain the same as in Fig. 4a. Here we Sj þ Lj Si þ Li ; 8 i; j 2 V: (18)
Authorized licensed use limited to: R V College of Engineering. Downloaded on December 10,2024 at 14:49:42 UTC from IEEE Xplore. Restrictions apply.
DHAR ET AL.: DML: DYNAMIC PARTIAL RECONFIGURATION WITH SCALABLE TASK SCHEDULING FOR MULTI-APPLICATIONS ON... 2583
Fig. 10. Impact of batching strategies. Relative speedup shown for batch Fig. 12. Impact of different batching strategies. Slot utilization shown for
size 4 and 32 on a Zedboard, as predicted by the scheduler. No DPR sol- batch size 4 and 32 on a Zedboard, as predicted by the scheduler.
utions cannot fit a single instance of IMGC, OF, and AL4 in a Zedboard.
Note: 8-way parallel batching cannot be done on batch of 4.
Fig. 13. Impact of different batching strategies. Slot utilization shown for
batch size 4 and 32 on a ZCU106, as predicted by the scheduler.
Fig. 11. Impact of batching strategies. Relative speedup shown for batch our utilization at 50% on average, while enabling paralleli-
size 4 and 32 on a ZCU106, as predicted by the scheduler. DR on batch zation brings it up to 84% on average, and up to 99%. The
of 32 in No DPR scenario is clipped for presentation, and it extends to
31.8X. Note: 8-way parallel batching cannot be done on batch of 4. ZCU106 has more resources, and can be harder to keep
busy for small applications and batch sizes. However, with
the lack of available slots and graph partitioning. In con- the help of parallelization, we are able to effectively utilize
trast, we see up to 6.8X speedup on the ZCU106 with the 53% on average, and up to 88%. Note that parallelization
help of eight-way parallel batching, and 4.15X on average. approach forces the scheduler to partition and perform
Thus, our scheduler is able to effectively utilize the available localized scheduling which limits the efficiency.
FPGA resources and parallelism in the graphs and batches.
We also note that the relatively limited resources of the 5.3.1 HW Evaluation
Zedboard does not allow many of the applications to map Having demonstrated the efficacy of our scheduler and
to it. Thus, we see that the No DPR solution was unable to its different batching strategies, we will now evaluate
provide a solution for OF, IMGC, and AL4. This further them on real systems. Figs. 14 and 15 show the speedup
highlights the need for our DML strategy, which allows com- with the same baseline for bulk-batching, pipelining,
pute to be mapped efficiently to any device. In cases where the four-way parallel batching, and eight-way parallel batch-
applications do fit, our pipelining and parallelization approach ing on the Zedboard and ZCU106 respectively. Once
is able to perform better by better utilization of the FPGA again, we observe that pipelining and parallelism can
resources. On the ZCU106, which has significantly more greatly increase the performance of partial reconfigura-
resources, we see that for small batch sizes, it might be advan- tion applications on our hardware implementation. On
tageous to simply instantiate copies of the application (No the Zedboard for batch 32 we observe an average
DPR) if the application is very small, such as DR. However, speedup of 2.87X across all applications and a max
given enough parallelism, we see that our scheduler is still speedup of 3.98X. On the ZCU106 for batch 32, we
able to better utilize the FPGA resources, even on the ZCU106. observe an average speedup of 4.99X and a max speedup
Next, we examine the effective utilization of resources by of 7.65X. Here we observe that the speedup seen in the
our methodology by considering the slot utilization. Effective hardware implementation is higher than that predicted
slot utilization measures what percentage of available slots by the scheduler. For a batch size of 32, the average
were used over the run of the application on an average. A speedup across all applications as measured on hard-
utilization of 100% would imply all slots were used across ware is 1.04X and 1.12X higher than that predicted by
the run. Figs. 12 and 13 present our analysis. the scheduler for the Zedboard and ZCU106 respectively.
Once again we see the effectiveness of our scheduling This can be attributed to the IPs running slower in hard-
solutions. On the Zedboard, pipelining alone is able to keep ware than predicted by Vivado HLS. Thus, DPR latency
Authorized licensed use limited to: R V College of Engineering. Downloaded on December 10,2024 at 14:49:42 UTC from IEEE Xplore. Restrictions apply.
DHAR ET AL.: DML: DYNAMIC PARTIAL RECONFIGURATION WITH SCALABLE TASK SCHEDULING FOR MULTI-APPLICATIONS ON... 2587
Fig. 14. Impact of different batching strategies. Relative speedup shown Fig. 16. Impact of slot size on performance. Relative speedup shown for
for batch size 4 and 32 on a Zedboard, as measured on hardware. Note: batch size 4 and 32 on a ZCU106 with 4-way parallel batching, for slot
8-way parallel batching cannot be done on batch of 4. size 1X and 2X.
Fig. 15. Impact of different batching strategies. Relative speedup shown Fig. 17. Impact of slot size on performance. Relative speedup shown for
for batch size 4 and 32 on a ZCU106, as measured on hardware. Note: batch of 4 and 32 on a ZCU106 with pipelining, for slot size 1X and 2X.
8-way parallel batching cannot be done on batch of 4.
5.4 Choosing Slot Size
is shorter, relative to the total runtime of the IP, in hard- So far, in this section we have only considered 1X sized
ware. This reduces the relative overhead of DPR, making slots. We will now explore the impact of choosing a larger
it easier to hide. Hence, the latency predicted by the slot size. Figs. 16 and 17 show the speedups for both 1X and
scheduler is longer, which results in the predicted 2X slot sizes on the ZCU106 when performing pipelining
speedup being more pessimistic. In the case of our paral- and four-way parallel batching respectively. The 1X slot
lelization strategies, which provide the best performance, design is using ten slots while the 2X design is using four
frequent PRs must be performed, and thus the average slots. As one can see, using 1X slots always achieves a
speedup predicted by the scheduler is slightly lower higher or the same speedup when compared to the 2X slots.
than what we observe in hardware. This is for several reasons. First, not all IPs are able to fill the
Comparing Figs. 14 and 15 with Figs. 10 and 11 shows that 2X slot, wasting precious resources. Second, 1X gives more
the performance trends are the same in both the scheduler flexibility for how IPs can be mapped across slots in time
and the hardware. In all but two instances, the best perform- and space, as there are more IPs and more slots. This gives
ing optimization, as predicted by the scheduler, is the best the ILP scheduler a larger space to find a high-performance
optimization in hardware, on all boards and batch sizes. This schedule. Third, 2X slots take almost twice as long to recon-
means that the scheduler’s performance model is able to figure, thereby increasing the impact of DPR latency.
effectively model the performance of the hardware imple- Finally, 1X slots give more fine-grained specialization than
mentation. The two discrepancies are LeNet with a batch of 4 2X, allowing each IP to be more specialized to the specific
on the Zedboard, and LeNet with a batch of 32 on the computation it is performing.
ZCU106. This is because LeNet contains small IPs which
have a very low latency, making it more difficult to hide the
latency of DPR. As mentioned previously, the cost of DPR in 5.5 Scalability and Multiple Applications
the scheduler’s performance model is higher than that in the We now demonstrate the scalability of our solution, and
hardware. As LeNet already has difficulty masking the used our scheduler to simultaneously map ten applications
latency of DPR, in both discrepencies, the scheduler predicts (The six application previously mentioned plus four syn-
a solution with less DPR (pipelining with batch 4 and four- thetic benchmarks) to a single FPGA, across varying batch,
way parallel batching with batch 32) would be more perform- resource, and slot sizes, and evaluate our fair-cut and
ant. However, we see in the hardware that LeNet is actually sorted-cut methods. Table 1 presents the scale of the prob-
able to hide the DPR latency, and is most performant with the lem, for a batch size of 32, with eight available slots, with
maximum amount of parallelism available for either batch. pipelining enabled. Since our scheduler runs ILP on several
Authorized licensed use limited to: R V College of Engineering. Downloaded on December 10,2024 at 14:49:42 UTC from IEEE Xplore. Restrictions apply.
2588 IEEE TRANSACTIONS ON COMPUTERS, VOL. 71, NO. 10, OCTOBER 2022
TABLE 1 TABLE 3
Problem Size for Mapping Multiple Applications Speedups on Hardware and Scheduler Over Coarse-Grained
Multi-App Scheduling for Four Apps
SlotSize Time(s) ILP(s) Node MaxVar MinVar AvgVar
1X 768.3 23.4 178 3735 2808 3657.5 Fair Sorted
2X 318.9 22.1 83 3735 1068 3290.5 Batch Sched HW Sched HW
4 1.30x 1.34x 1.44x 1.28x
16 1.40x 1.30x 1.64x 1.48x
cuts of the graph, we present the average number of ILP 32 1.42x Out-of-Memory 1.68x Out-of-Memory
variables that are solved, along with the max and min. We
also list the total time taken to find the final solution, as well
as the total time spent solving the ILP alone. even within smaller graph cuts. Here we note that the sorted
Table 2 presents the normalized speedup of mapping and fair cut strategies perform similarly. In this group of
multiple applications on to a single FPGA, as predicted by applications, a few select applications dominate the end-to-
the scheduler. We consider FPGAs ranging from a small end runtime. Thus, no matter how we perform the graph
edge-scale device with four slots, to a large cloud-scale with cuts, the critical path is determined by the same set of verti-
sixteen slots, and consider slot sizes of 1X and 2X, batch ces. Also, we can see that for larger batch sizes and more
sizes of 4 to 32. Here we consider coarse-grained multi-app slots, the effective slot utilization is poor. This is due to the
scheduling, as discussed in Section 4.3 to be our baseline, large disparity in task graphs. Larger graphs, with long
wherein each application executes serially, incurring one latencies and limited task-parallelism, consume the tail end
time reconfiguration cost for each application, and assume of the schedule, and only require one or two slots.
that there are enough resources for the entire application to Once again, we will validate our scheduler by testing it in
fit. Using this baseline, we present the speedup, as reported hardware. We consider four applications, OF, LN, AL4, and
by our scheduler for the sorted-cut and fair-cut schemes. 3DR, running simultaneously on the ZCU106, with ten 1X
We observe that our schedule is faster than the baseline slots. We present the end-to-end speedup as measured on
across almost every case. This is in part due to our sched- the board versus the scheduler prediction in Table 3. As we
uler’s ability to pipeline and overlap the execution of IPs, can see, the hardware performance once again matches the
predicted scheduler performance. Note that the ZCU106
TABLE 2 did not have sufficient memory (off-chip DRAM) to host all
Mapping Multiple Applications four applications with a batch size of 32. In addition, unlike
our 10 application experiment, the sorted cut strategy out-
1X 2X performs the fair-cut strategy as the end-to-end latency is
BatchSize Slots Fair Sorted Fair Sorted not dominated by a single application in the tail-end.
Speedup
5.6 Dependent versus Independent Scheduling for
4 0.93x 0.93x 1.51x 1.49x Multiple Applications
8 1.16x 1.15x 2.14x 2.36x We will now explore the impact of different scheduling
10 1.21x 1.15x 2.48x 2.44x
approaches for multiple applications. As discussed in Sec-
4 16 1.22x 1.19x 2.57x 2.49x
tion 4.3, the DML framework allows for two multi-application
4 1.15x 1.15x 1.51x 1.50x scheduling schemes: independent-scheduling and dependent-
8 1.47x 1.43x 2.13x 2.42x scheduling. We use both scheduling schemes to generate
10 1.52x 1.44x 2.48x 2.52x multi-app schedules for four concurrently running applica-
16 16 1.52x 1.49x 2.55x 2.53x
tions: LeNet, AlexNet, Optical Flow, and 3D Rendering, and
4 1.16x 1.16x 1.51x 1.50x consider the total end-to-end latency of all four applications,
8 1.48x 1.43x 2.13x 2.43x and the end-to-end latency of each application. We run our
10 1.53x 1.45x 2.48x 2.53x experiments on the ZCU106 board with ten 1X slots, and a
32 16 1.53x 1.50x 2.55x 2.54x batch size of 16. For independent scheduling, we consider
Slot Utilization three different allocations of slots per application. As dis-
cussed in Section 4.3, in dependent scheduling, the scheduler
4 0.64 0.63 0.89 0.86
8 0.41 0.41 0.61 0.67 will allocate slots to IPs from the global pool and try to opti-
10 0.34 0.33 0.59 0.56 mize the total end-to-end latency.
4 16 0.21 0.21 0.38 0.35 Fig. 18 shows the end-to-end speedup of the indepen-
dent-schedule over the dependent-schedule for the total
4 0.66 0.64 0.88 0.86
runtime of all application and the per-application runtime.
8 0.43 0.43 0.6 0.68
10 0.37 0.35 0.59 0.57 The per-application slot allocations are shown in the legend.
16 16 0.23 0.23 0.38 0.36 Independent scheduling always has worse end-to-end
latency, and on average increases the end-to-end latency by
4 0.66 0.64 0.88 0.86 1.74X. This is because the dependent scheduling optimizes
8 0.44 0.43 0.60 0.68
over a monolithic graph across all available slots, so it can
10 0.37 0.35 0.59 0.58
32 16 0.23 0.23 0.39 0.36 prioritize the applications with the longest latency. How-
ever, dependent scheduling only optimizes the end-to-end
Authorized licensed use limited to: R V College of Engineering. Downloaded on December 10,2024 at 14:49:42 UTC from IEEE Xplore. Restrictions apply.
DHAR ET AL.: DML: DYNAMIC PARTIAL RECONFIGURATION WITH SCALABLE TASK SCHEDULING FOR MULTI-APPLICATIONS ON... 2589
Authorized licensed use limited to: R V College of Engineering. Downloaded on December 10,2024 at 14:49:42 UTC from IEEE Xplore. Restrictions apply.
DHAR ET AL.: DML: DYNAMIC PARTIAL RECONFIGURATION WITH SCALABLE TASK SCHEDULING FOR MULTI-APPLICATIONS ON... 2591
Wei Zuo received the MS degree from the Electronics and Communica- Deming Chen (Fellow, IEEE) received the BS degree in computer sci-
tion Engineering Department, University of Illinois, Urbana-Champaign, ence from the University of Pittsburgh in 1995, and the MS and PhD
where she is currently working toward the PhD degree with Electrical and degrees in computer science from the University of California, Los
Computer Engineering Department. Her research interests include hard- Angeles, in 2001 and 2005, respectively. He is currently the Abel Bliss pro-
ware-software co-design, developing automated frameworks for accurate fessor of engineering with the Electronics and Communication Engineer-
system-level modeling, and efficient hardware–software partitioning for ing Department, University of Illinois, Urbana-Champaign (UIUC). His
applications mapped to SoC platforms. research interests include system-level and high-level synthesis, machine
learning, computational genomics, reconfigurable computing, and hard-
ware security. He was the recipient of the UIUC’s Arnold O. Beckman
Xiaohao Wang received the BS and MS degrees from the University Research Award, NSF CAREER Award, nine best paper awards, ACM
of Illinois, Urbana-Champaign. His research focuses on hardware SIGDA Outstanding New Faculty Award, and IBM Faculty Award. He is
systems. an ACM distinguished speaker and the editor-in-chief of ACM Transac-
tions on Reconfigurable Technology and Systems.
Nam Sung Kim (Fellow, IEEE) is currently a professor with the Univer-
sity of Illinois, Urbana-Champaign. He has authored or coauthored " For more information on this or any other computing topic,
more than 200 refereed articles to highly-selective conferences and please visit our Digital Library at www.computer.org/csdl.
journals in the field of digital circuit, processor architecture, and com-
puter-aided design. The top three most frequently cited papers have
more than 4000 citations and the total number of citations of all
his papers exceeds 11,000. He was the recipient of the IEEE Interna-
tional Symposium on Microarchitecture (MICRO) Best Paper Award in
2003, NSF CAREER Award in 2010, and ACM/IEEE Most Influential
International Symposium on Computer Architecture (ISCA) Paper
Award in 2017. He is a hall of fame member of IEEE International
Symposium on High-Performance Computer Architecture (HPCA),
MICRO, and ISCA. He is a fellow of ACM.
Authorized licensed use limited to: R V College of Engineering. Downloaded on December 10,2024 at 14:49:42 UTC from IEEE Xplore. Restrictions apply.