0% found this document useful (0 votes)
16 views14 pages

Developing Accurate and Scalable Simulators of Production Workflow Management Systems With WRENCH

The document presents WRENCH, a simulation framework designed to facilitate accurate and scalable simulations of Workflow Management Systems (WMSs). It addresses the limitations of real-world experiments by providing high-level abstractions built on the SimGrid framework, enabling easier development of simulators for complex systems. The paper includes case studies demonstrating WRENCH's ease of implementation, accuracy, and scalability compared to existing simulators.

Uploaded by

fellipe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views14 pages

Developing Accurate and Scalable Simulators of Production Workflow Management Systems With WRENCH

The document presents WRENCH, a simulation framework designed to facilitate accurate and scalable simulations of Workflow Management Systems (WMSs). It addresses the limitations of real-world experiments by providing high-level abstractions built on the SimGrid framework, enabling easier development of simulators for complex systems. The paper includes case studies demonstrating WRENCH's ease of implementation, accuracy, and scalability compared to existing simulators.

Uploaded by

fellipe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Future Generation Computer Systems 112 (2020) 162–175

Contents lists available at ScienceDirect

Future Generation Computer Systems


journal homepage: www.elsevier.com/locate/fgcs

Developing accurate and scalable simulators of production workflow


management systems with WRENCH
∗ ∗∗
Henri Casanova a , , Rafael Ferreira da Silva b,c , , Ryan Tanaka a,b , Suraj Pandey a ,
Gautam Jethwani c , William Koch a , Spencer Albrecht c , James Oeth c , Frédéric Suter d
a
Information and Computer Sciences, University of Hawaii, Honolulu, HI, USA
b
University of Southern California, Information Sciences Institute, Marina del Rey, CA, USA
c
University of Southern California, Department of Computer Science, Los Angeles, CA, USA
d
IN2P3 Computing Center, CNRS, Villeurbanne, France

article info a b s t r a c t

Article history: Scientific workflows are used routinely in numerous scientific domains, and Workflow Management
Received 1 July 2019 Systems (WMSs) have been developed to orchestrate and optimize workflow executions on distributed
Received in revised form 28 April 2020 platforms. WMSs are complex software systems that interact with complex software infrastructures.
Accepted 20 May 2020
Most WMS research and development activities rely on empirical experiments conducted with full-
Available online 26 May 2020
fledged software stacks on actual hardware platforms. These experiments, however, are limited to
Keywords: hardware and software infrastructures at hand and can be labor- and/or time-intensive. As a result,
Scientific workflows relying solely on real-world experiments impedes WMS research and development. An alternative
Workflow management systems is to conduct experiments in simulation. In this work we present WRENCH, a WMS simulation
Simulation framework, whose objectives are (i) accurate and scalable simulations; and (ii) easy simulation
Distributed computing software development. WRENCH achieves its first objective by building on the SimGrid framework.
While SimGrid is recognized for the accuracy and scalability of its simulation models, it only provides
low-level simulation abstractions and thus large software development efforts are required when
implementing simulators of complex systems. WRENCH thus achieves its second objective by providing
high-level and directly re-usable simulation abstractions on top of SimGrid. After describing and giving
rationales for WRENCH’s software architecture and APIs, we present two case studies in which we
apply WRENCH to simulate the Pegasus production WMS and the WorkQueue application execution
framework. We report on ease of implementation, simulation accuracy, and simulation scalability so
as to determine to which extent WRENCH achieves its objectives. We also draw both qualitative and
quantitative comparisons with a previously proposed workflow simulator.
© 2020 Published by Elsevier B.V.

1. Introduction to execute workflows on distributed platforms that can accom-


modate executions at various scales. WMSs handle the logis-
tics of workflow executions and make decisions regarding re-
Scientific workflows have become mainstream in support of source selection, data management, and computation scheduling,
research and development activities in numerous scientific do- the goal being to optimize some performance metric (e.g., la-
mains [1]. Consequently, several Workflow Management Sys- tency [8,9], throughput [10,11], jitter [12], reliability [13–15],
tems (WMSs) have been developed [2–7] that allow scientists power consumption [16,17]). WMSs are complex software sys-
tems that interact with complex software infrastructures and can
thus employ a wide range of designs and algorithms.
∗ Correspondence to: University of Hawaii Information and In spite of active WMS development and use in production,
Computer Sciences, POST Building, Rm 317 1680 East-West which has entailed solving engineering challenges, fundamen-
Road, Honolulu, HI, 96822, USA. tal questions remain unanswered in terms of system designs
∗∗ Correspondence to: USC Information Sciences Institute, 4676 Admiralty
and algorithms. Although there are theoretical underpinnings for
Way Suite 1001, Marina del Rey, CA, 90292, USA.
most of these questions, theoretical results often make assump-
E-mail addresses: [email protected] (H. Casanova), [email protected]
tions that do not hold with production hardware and software
(R. Ferreira da Silva), [email protected] (R. Tanaka), [email protected]
(S. Pandey), [email protected] (G. Jethwani), [email protected] infrastructures. Further, the specifics of the design of a WMS
(W. Koch), [email protected] (S. Albrecht), [email protected] (J. Oeth), can impose particular constraints on what solutions can be im-
[email protected] (F. Suter). plemented effectively, and these constraints are typically not

https://doi.org/10.1016/j.future.2020.05.030
0167-739X/© 2020 Published by Elsevier B.V.
H. Casanova, R. Ferreira da Silva, R. Tanaka et al. / Future Generation Computer Systems 112 (2020) 162–175 163

considered in available theoretical results. Consequently, current 1. We justify the need for WRENCH and explain how it im-
research that aims at improving and evolving the state of the art, proves on the state of the art.
although sometimes informed by theory, is mostly done via ‘‘real- 2. We describe the high-level simulation abstractions pro-
world’’ experiments: designs and algorithms are implemented, vided by WRENCH that (i) make it straightforward to im-
evaluated, and selected based on experiments conducted for a plement full-fledged simulated versions of complex WMS
particular WMS implementation with particular workflow con- systems; and (ii) make it possible to instantiate simulation
figurations on particular platforms. As a corollary, from the WMS scenarios with only few lines of code.
user’s perspective, quantifying accurately how a WMS would 3. Via two case studies with the Pegasus [2] production WMS
perform for a particular workflow configuration on a particu- and the WorkQueue [49] application execution framework,
lar platform entails actually executing that workflow on that we evaluate the ease-of-use, accuracy, and scalability of
platform. WRENCH, and compare it with a previously proposed sim-
Unfortunately, real-world experiments have limited scope, ulator, WorkflowSim [35].
which impedes WMS research and development. This is because
they are confined to application and platform configurations This paper is organized as follows. Section 2 discusses re-
available at hand, and thus cover only a small subset of the lated work. Section 3 outlines the design of WRENCH and de-
relevant scenarios that may be encountered in practice. Further- scribes how its APIs are used to implement simulators. Section 4
more, exclusively relying on real-world experiments makes it presents our case studies. Finally, Section 5 concludes with a brief
difficult or even impossible to investigate hypothetical scenarios summary of results and a discussion of future research directions.
(e.g., ‘‘What if the network had a different topology?’’, ‘‘What if
there were 10 times more compute nodes but they had half as 2. Related work
many cores?’’). Real-world experiments, especially when large-
scale, are often not fully reproducible due to shared networks and Many simulation frameworks have been developed for paral-
lel and distributed computing research and development. They
compute resources, and due to transient or idiosyncratic behav-
span domains such as HPC [18–21], Grid [22–24], Cloud [25–
iors (maintenance schedules, software upgrades, and particular
27], Peer-to-peer [28,29], or Volunteer Computing [30–32]. Some
software (mis)configurations). Running real-world experiments is
frameworks have striven to be applicable across some or all or
also time-consuming, thus possibly making it difficult to obtain
the above domains [33,34]. Two conflicting concerns are accuracy
statistically significant numbers of experimental results. Real-
(the ability to capture the behavior of a real-world system with
world experiments are driven by WMS implementations that
as little bias as possible) and scalability (the ability to simulate
often impose constraints on workflow executions. Furthermore,
large systems with as few CPU cycles and bytes of RAM as
WMSs are typically not monolithic but instead reuse CyberInfras-
possible). The aforementioned simulation frameworks achieve
tructure (CI) components that impose their own overheads and
different compromises between these two concerns by using var-
constraints on workflow execution. Exploring what lies beyond
ious simulation models. At one extreme are discrete event models
these constraints via real-world executions, e.g., for research and
that simulate the ‘‘microscopic’’ behavior of hardware/software
development purposes, typically entails unacceptable software
systems (e.g., by relying on packet-level network simulation for
(re-)engineering costs. Finally, running real-world experiments
communication [50], on cycle-accurate CPU simulation [51] or
can also be labor-intensive. This is due to the need to install
emulation for computation). In this case, the scalability challenge
and execute many full-featured software stacks, including actual
can be handled by using Parallel Discrete Event Simulation [52],
scientific workflow implementations, which is often not deemed
i.e., the simulation itself is a parallel application that requires a
worthwhile for ‘‘just testing out’’ ideas.
parallel platform whose scale is at least commensurate to that
An alternative to conducting WMS research via real-world
of the simulated platform. At the other extreme are analytical
experiments is to use simulation, i.e., implement a software ar-
models that capture ‘‘macroscopic’’ behaviors (e.g., transfer times
tifact that models the functional and performance behaviors of as data sizes divided by bottleneck bandwidths, compute times
software and hardware stacks of interest. Simulation is used in as numbers of operations divided by compute speeds). While
many computer science domains and can address the limitations these models are typically more scalable, they must be developed
of real-world experiments outlined above. Several simulation with care so that they are accurate. In previous work, it has
frameworks have been developed that target the parallel and been shown that several available simulation frameworks use
distributed computing domain [18–34]. Some simulation frame- macroscopic models that can exhibit high inaccuracy [43].
works have also been developed specifically for the scientific A number of simulators have been developed that target scien-
workflow domain [11,35–40]. tific workflows. Some of them are stand-alone simulators [11,35–
We claim that advances in simulation capabilities in the field 37,53]. Others are integrated with a particular WMS to promote
have made it possible to simulate WMSs that execute large work- more faithful simulation and code re-use [38,39,54] or to exe-
flows using diverse CI services deployed on large-scale platforms cute simulations at runtime to guide on-line scheduling decisions
in a way that is accurate (via validated simulation models), scal- made by the WMS [40].
able (fast execution and low memory footprint), and expressive The authors in [39] conduct a critical analysis of the state-of-
(ability to describe arbitrary platforms, complex WMSs, and com- the-art of workflow simulators. They observe that many of these
plex software infrastructure). In this work, we build on the ex- simulators do not capture the details of underlying infrastruc-
isting open-source SimGrid simulation framework [33,41], which tures and/or use naive simulation models. This is the case with
has been one of the drivers of the above advances and whose custom simulators such as that in [36,37,40]. But it is also the
simulation models have been extensively validated [42–46], to case with workflow simulators built on top of generic simulation
develop a WMS simulation framework called WRENCH [47]. More frameworks that provide convenient user-level abstractions but
specifically, this work makes the following contributions1 : fail to model the details of the underlying infrastructure, e.g., the
simulators in [11,35,38], which build on the CloudSim [25] or
1 A preliminary shorter version of this paper appears in the proceed- GroudSim [24] frameworks. These frameworks have been shown
ings of the 2018 Workshop on Workflows in Support of Large-Scale Science to lack in their network modeling capabilities [43]. As a result,
(WORKS) [48]. some authors readily recognize that their simulators are likely
164 H. Casanova, R. Ferreira da Silva, R. Tanaka et al. / Future Generation Computer Systems 112 (2020) 162–175

only valid when network effects play a small role in workflow


executions (i.e., when workflows are not data-intensive).
To overcome the above limitations, in [39,54] the authors
have improved the network model in GroudSim and also use a
separate simulator, DISSECT-CF [27], for simulating cloud infras-
tructures accurately. The authors acknowledge that the popular
SimGrid [33,41] simulation framework offers compelling capabil-
ities, both in terms of scalability and simulation accuracy. But one
of their reasons for not considering SimGrid is that, because it is
low-level, using it to implement a simulator of a complex system,
such as a WMS and the CI services it uses, would be too labor-
intensive. In this work, we address this issue by developing a
simulation framework that provides convenient, reusable, high-
level abstractions but that builds on SimGrid so as to benefit
from its scalable and accurate simulation models. Furthermore,
unlike [38,39,54], we do not focus on integration with any specific
WMS. The argument in [39] is that stand-alone simulators, such
as that in [35], are disconnected from real-world WMSs because
they abstract away much of the complexity of these systems.
Instead, our proposed framework does capture low-level system
details (and simulates them well thanks to SimGrid), but provides
high-level enough abstractions to implement faithful simulations
of complex WMSs with minimum effort, which we demonstrate
via two case studies.
Also related to this work is previous research that has not
focused on providing simulators or simulation frameworks per
Fig. 1. The four layers in the WRENCH architecture from bottom to top:
se, but instead on WMS simulation methodology. In particu- simulation core, simulated core services, simulated WMS implementations, and
lar, several authors have investigated methods for injecting re- simulators.
alistic stochastic noise in simulated WMS executions [35,55].
These techniques can be adopted by most of the aforementioned
frameworks, including the one proposed in this work. for various workflow configurations, comparing different
platform and resource provisioning options, determining
3. WRENCH performance bottlenecks, engaging in pedagogic activities
centered on distributed computing and workflow issues,
3.1. Objective and intended users etc. These users can develop simulators via the WRENCH
User API (described in Section 3.5), which makes it possible
WRENCH’s objective is to make it possible to study WMSs in
to build a full-fledged simulator with only a few lines of
simulation in a way that is accurate (faithful modeling of real-
code.
world executions), scalable (low computation and memory foot-
prints on a single computer), and expressive (ability to simulate Users in the first category above often also belong to the sec-
arbitrary WMS, workflow, and platform scenarios with minimal ond category. That is, after implementing a simulated WMS these
software engineering effort). WRENCH is not a simulator but a users typically instantiate simulators for several experimental
simulation framework that is distributed as a C++ library. It pro- scenarios to evaluate their WMS.
vides high-level reusable abstractions for developing simulated
WMS implementations and simulators for the execution of these 3.2. Software architecture overview
implementations. There are two categories of WRENCH users:

1. Users who implement simulated WMSs — These users are Fig. 1 depicts WRENCH’s software architecture. At the bottom
engaged in WMS research and development activities and layer is the Simulation Core, which simulates low-level soft-
need an ‘‘in simulation’’ version of their current or intended ware and hardware stacks using the simulation abstractions and
WMS. Their goals typically include evaluating how their models provided by SimGrid (see Section 3.3). The next layer
WMS behaves over hypothetical experimental scenarios implements simulated CI services that are commonly found in
and comparing competing algorithm and system design current distributed platforms and used by production WMSs. At
options. For these users, WRENCH provides the WRENCH the time of this writing, WRENCH provides services in 4 cate-
Developer API (described in Section 3.4) that eases WMS gories: compute services that provide access to compute resources
development by removing the typical difficulties involved to execute workflow tasks; storage services that provide access to
when developing, either in real-world or in simulation storage resources for storing workflow data; network monitoring
mode, a system comprised of distributed components that services that can be queried to determine network distances;
interact both synchronously and asynchronously. To this and data registry services that can be used to track the location
end, WRENCH makes it possible to implement a WMS as of (replicas of) workflow data. Each category includes multiple
a single thread of control that interacts with simulated CI service implementations, so as to capture specifics of currently
services via high-level APIs and must react to a small set of available CI services used in production. For instance, WRENCH
asynchronous events. includes a ‘‘batch-scheduled cluster’’ compute service, a ‘‘cloud’’
2. Users who execute simulated WMSs — These users simu- compute service, and a ‘‘bare-metal’’ compute service. The above
late how given WMSs behave for particular workflows on layer in the software architecture consists of simulated WMS, that
particular platforms. Their goals include comparing differ- interact with CI services using the WRENCH Developer API (see
ent WMSs, determining how a given WMS would behave Section 3.4). These WMS implementations, which can simulate
H. Casanova, R. Ferreira da Silva, R. Tanaka et al. / Future Generation Computer Systems 112 (2020) 162–175 165

production WMSs or WMS research prototypes, are not included Algorithm 1 Blueprint for a WMS execution
as part of the WRENCH distribution, but implemented as stand-
1: procedure Main(w orkflow )
alone projects. Two such projects are the simulated Pegasus and 2: Obtain list of available services
Workqueue implementations used for our case study in Section 4. 3: Gather static information about the services
Finally, the top layer consists of simulators that configure and 4: while w orkflow execution has not completed/failed do
instantiate particular CI services and particular WMSs on a given 5: Gather dynamic service/resource information
simulated hardware platform, that launch the simulation, and 6: Make data/computation scheduling decisions
that analyze the simulation outcome. These simulators use the 7: Interact with services to enact decisions
WRENCH User API (see Section 3.5). Here again, these simula- 8: Wait for and react to the next event
tors are not part of WRENCH, but implemented as stand-alone 9: end while
projects. 10: return
11: end procedure

3.3. Simulation core

WRENCH’s simulation core is implemented using SimGrid’s service, the list of hosts monitored by a network monitoring
S4U API, which provides all necessary abstractions and models to service, etc. Then, the WMS iterates until the workflow execution
simulate computation, I/O, and communication activities on arbi- is complete or has failed (line 4). At each iteration it gathers
trary hardware platform configurations. These platform configu- dynamic information about available services and resources if
rations are defined by XML files that specify network topologies needed (line 5). Example of such information include currently
and endpoints, compute resources, and storage resources [56]. available capacities at compute or storage services, current net-
At its most fundamental level, SimGrid provides a Concur- work distances between pairs of hosts, etc. Based on resource
rent Sequential Processes (CSP) model: a simulation consists of information and on the current state of the workflow, the WMS
sequential threads of control that consume hardware resources. can then make whatever scheduling decisions it sees fit (line 7).
These threads of control can implement arbitrary code, exchange It then enacts these decisions by interacting with appropriate
messages via a simulated network, can perform computation on services. For instance, it could decide to submit a ‘‘job’’ to a
simulated (multicore) hosts, and can perform I/O on simulated compute service to execute a ready task on some number of cores
storage devices. In addition, SimGrid provides a virtual machine at some compute service and copy all produced files to some
abstraction that includes a migration feature. Therefore, SimGrid storage service, or it could decide to just copy a file between
provides all the base abstractions necessary to implement the storage services and then update a data location service to keep
classes of distributed systems that are relevant to scientific work- track of the location of this new file replica. It could also submit
flow executions. However, these abstractions are low-level and a one or more pilot jobs [57] to compute services if they support
common criticism of SimGrid is that implementing a simulation them. It is the responsibility of the developer to implement all
of a complex system requires a large software engineering effort. decision-making algorithms employed by the WMS. At the end
A WMS executing a workflow using several CI services is a com- of the iteration, the WMS simply waits for a workflow execution
plex system, and WRENCH builds on top of SimGrid to provide event to which it can react if need be. Most common events are
high-level abstractions so that implementing this complex system job completions/failures and data transfer completions/failures.
is not labor-intensive. The WRENCH Developer API provides a rich set of methods to
We have selected SimGrid for WRENCH for the following create and analyze a workflow and to interact with CI services
reasons. SimGrid has been used successfully in many distributed to execute a workflow. These methods were designed based
computing domains (cluster, peer-to-peer, grid, cloud, volunteer on current and envisioned capabilities of current state-of-the-
computing, etc.), and thus can be used to simulate WMSs that ex- art WMSs. We refer the reader to the WRENCH Web site [47]
ecute over a wide range of platforms. SimGrid is open source and for more information on how to use this API and for the full
freely available, has been stable for many years, is actively devel- API documentation. The key objective of this API is to make it
oped, has a sizable user community, and has provided simulation straightforward to implement a complex system, namely a full-
results for over 350 research publications since its inception. Sim- fledged WMS that interact with diverse CI services. We achieve
Grid has also been the object of many invalidation and validation this objective by providing simple solutions and abstractions to
studies [42–46], and its simulation models have been shown to handle well-known challenges when implementing a complex
provide compelling advantages over other simulation frameworks distributed system (whether in the real world or in simulation),
in terms of both accuracy and scalability [33]. Finally, most Sim- as explained hereafter.
Grid simulations can be executed in minutes on a standard laptop SimGrid provides simple point-to-point communication be-
computer, making it possible to perform large numbers of simu- tween threads of control via a mailbox abstraction. One of the
lations quickly with minimal compute resource expenses. To the recognized strengths of SimGrid is that it employs highly accurate
best of our knowledge, among comparable available simulation and yet scalable network simulation models. However, unlike
frameworks (as reviewed in Section 2), SimGrid is the only one some of its competitors, it does not provide any higher-level
to offer all the above desirable characteristics. simulation abstractions meaning that distributed systems must
be implemented essentially from scratch, with message-based
3.4. WRENCH Developer API interactions between processes. All message-based interaction
is abstracted away by WRENCH, and although the simulated CI
With the Developer API, a WMS is implemented as a single services exchange many messages with the WMS and among
thread of control that executes according to the pseudo-code themselves, the WRENCH Developer API only exposes higher-
blueprint shown in Algorithm 1. Given a workflow to execute, level interaction with services (‘‘run this job’’, ‘‘move this data’’)
a WMS first gathers information about all the CI services it and only requires that the WMS handle a few events. The WMS
can use to execute the workflow (lines 2–3). Examples of such developer thus completely avoids the need to send and receive
information include the number of compute nodes provided by a (and thus orchestrate) network messages.
compute service, the number of cores per node and the speed of Another challenge when developing a system like a WMS
these cores, the amount of storage space available in a storage is the need to handle asynchronous interactions. While some
166 H. Casanova, R. Ferreira da Silva, R. Tanaka et al. / Future Generation Computer Systems 112 (2020) 162–175

service interactions can be synchronous (e.g., ‘‘are you up?’’, ‘‘tell


me your current load’’), most need to be asynchronous so that
the WMS retains control. The typical solution is to maintain sets
of request handles and/or to use multiple threads of control. To
free the WMS developer from these responsibilities, WRENCH
provides already implemented ‘‘managers’’ that can be used out-
of-the-box to take care of asynchronicity. A WMS can instantiate
such managers, which are independent threads of control. Each
manager transparently interacts with CI services, maintains a
set of pending requests, provides a simple API to check on the
status of these requests, and automatically generates high-level
workflow execution events. For instance, a WMS can instantiate
a ‘‘job manager’’ through which it will create and submit jobs
to compute services. It can at any time check on the status of a
job, and the job manager interacts directly (and asynchronously)
with compute services so as to generate ‘‘job done’’ or ‘‘job failed’’
events to which the WMS can react. In our experience developing
simulators from scratch using SimGrid, the implementation of
asynchronous interactions with simulated processes is a non-
trivial development effort, both in terms of amount of code to
write and difficulty to write this code correctly. We posit that this
is one of the reasons why some users have preferred using sim-
ulation frameworks that provide higher-level abstractions than
SimGrid even though they offer less attractive accuracy and/or
scalability features. WRENCH provides such higher-level abstrac-
tions to the WMS developers, and as a result implementing a
WMS with WRENCH can be straightforward.
Finally, one of the challenges when developing a WMS is
failure handling. It is expected that compute, storage, and net-
work resources, as well as the CI services that use them, can fail
through the execution of the WMS. SimGrid has the capability
to simulate arbitrary failures via availability traces. Furthermore,
failures can occur due to the WMS implementation itself, e.g., if
it fails to check that the operations it attempts are actually valid,
if concurrent operations initiated by the WMS work at cross
purposes. WRENCH abstracts away all these failures as C++ excep-
tions that can be caught by the WMS implementation, or caught
by a manager and passed to the WMS as workflow execution
events. Regardless, each failure exposes a failure cause, which
encodes a detailed description of the failure. For instance, after
initiating a file copy from a storage service to another storage
service, a ‘‘file copy failed’’ event sent to the WMS would include
a failure cause that could specify that when trying to copy file x
from storage service y to storage service z, storage service z did Fig. 2. Screenshot of the Web-based WRENCH dashboard that shows, among
not have sufficient storage space. Other example failure causes other information not displayed here, execution details of each workflow task
and an interactive Gantt chart of the task executions.
could be that a network error occurred when storage service y
attempted to receive a message from storage service z, or that
service z was down. All CI services implemented in WRENCH
simulate well-defined failure behaviors, and failure handling ca- 1. Instantiate a platform based on a SimGrid XML platform
pabilities afforded to simulated WMSs can actually allow more description file;
sophisticated failure tolerance strategies than currently done or 2. Create one or more workflows;
possible in real-world implementations. But more importantly, 3. Instantiate services on the platform;
the amount of code that needs to be written for failure handling 4. Instantiate one or more WMSs telling each what services
in a simulated WMS is minimal. are at its disposal and what workflow it should execute
Given the above, WRENCH makes it possible to implement starting at what time;
a simulated WMS with very little code and effort. The example 5. Launch the simulation; and
WMS implementation provided with the WRENCH distribution, 6. Process the simulation outcome.
which is simple but functional, is under 200 lines of C++ (once
comments have been removed). See more discussion of the effort The above steps can be implemented with only a few lines
needed to implement a WMS with WRENCH in the context of our of C++. An example is provided and described in Appendix. This
case studies (Section 4). example showcases only the most fundamental features of the
WRENCH User API, and we refer the reader to the WRENCH Web
3.5. WRENCH User API site [47] for more detailed information on how to use this API and
for the full API documentation. In the future this API will come
With the User API one can quickly build a simulator, which with Python binding so that users can implement simulators in
typically follows these steps: Python.
H. Casanova, R. Ferreira da Silva, R. Tanaka et al. / Future Generation Computer Systems 112 (2020) 162–175 167

3.6. Simulation debugging and visualization

Analyzing and interpreting simulation logs is often a labor-


intensive process, and users of simulators typically develop sets
of scripts for parsing and extracting specific knowledge from
the logs. In order to provide WRENCH users with a rapid, first
insight on their simulation results, we have been developing a
Web-based ‘‘dashboard’’ that compiles simulation logs into a set
of tabular and graphical JavaScript components. The dashboard
presents overall views on task life-cycles, showing breakdowns
between compute and I/O operations, as well as a Gantt chart
and 2- and 3-dimensional plots of task executions and resource
usage during the simulated workflow execution. An overview of
energy consumption per compute resource can also be visualized
in the dashboard. The dashboard is currently under development
and will be available for users in the next WRENCH release. Fig. 2
shows a screenshot of a simple simulated workflow execution.
Fig. 3. Overview of the WRENCH Pegasus simulation components, including
components for DAGMan and HTCondor frameworks. Red boxes denote Pegasus
4. Case study: Simulating production WMSs services developed with WRENCH’s Developer API, and white boxes denote
WRENCH reused components.

In this section, we present two WRENCH-based simulators of a


state-of-the-art WMS, Pegasus [2], and an application execution
framework, WorkQueue [49], as case studies for evaluation and HTCondor is composed of six main service daemons (startd,
validation purposes. starter, schedd, shadow, negotiator, and collector). In
Pegasus is being used in production to execute workflows addition, each host on which one or more of these daemons
for dozens of high-profile applications in a wide range of sci- is spawned must also run a master daemon, which controls
entific domains [2]. Pegasus provides the necessary abstractions the execution of all other daemons (including initialization and
for scientists to create workflows and allows for transparent completion). The bottom part of Fig. 3 depicts the components
execution of these workflows on a range of compute platforms in- of our simulated HTCondor implementation, where daemons are
cluding clusters, clouds, and national cyberinfrastructures. During shown in red-bordered boxes. In our simulator we implement
execution, Pegasus translates an abstract resource-independent the 3 fundamental HTCondor services, implemented as particular
workflow into an executable workflow, determining the specific sets of daemons, as depicted in the bottom part of Fig. 3 in
executables, data, and computational resources required for the borderless white boxes. The Job Execution Service consists of a
execution. Workflow execution with Pegasus includes data man- startd daemon, which adds the host on which it is running to
agement, monitoring, and failure handling, and is managed by the HTCondor pool, and of a starter daemon, which manages
HTCondor DAGMan [58]. Individual workflow tasks are managed task executions on this host. The Central Manager Service consists
by a workload management framework, HTCondor [59], which of a collector daemon, which collects information about all
supervises task executions on local and remote resources. Work- other daemons, and of a negotiator daemon, which performs
flow executions with Pegasus follow a ‘‘push model’’, i.e., a task task/resource matchmaking. The Job Submission Service consists
is bound to a particular compute resource at the onset of the of a schedd daemon, which maintains a queue of tasks, and of
several instances of a shadow daemon, each of which corresponds
execution and is always executed on that resource if possible.
to a task submitted to the HTCondor pool for execution.
WorkQueue is being used in production by a wide range of re-
Given the simulated HTCondor implementation above, we
searchers across many scientific domains for building large-scale
then implemented the simulated Pegasus WMS, including the
master–worker applications that span thousands of machines
DAGMan workflow engine, using the WRENCH Developer API.
drawn from clusters, clouds, and grids [60]. WorkQueue allows
This implementation instantiates all services and parses the work-
users to define tasks, submit them to a workqueue abstraction,
flow description file, the platform description file, and a Pegasus-
and wait for their completions. During execution, WorkQueue
specific configuration file. DAGMan orchestrates the workflow
starts standard worker processes that can run on any available
execution (e.g., a task is marked as ready for execution once
compute resource. These worker processes perform data trans-
all its parent tasks have successfully completed), and monitors
fers and execute tasks, making it possible to execute workflows.
the status of tasks submitted to the HTCondor pool using a pull
Worker processes can also be submitted for execution on HT-
model, i.e., task status is fetched from the pool at regular time
Condor pools, which is the approach evaluated in this paper. intervals. The top part of Fig. 3 depicts the components of our
Workflow executions with WorkQueue follow a ‘‘pull model’’, simulated Pegasus implementation (each shown in a red box).
i.e., late-binding of tasks to compute resources based on when By leveraging WRENCH’s high-level simulation abstractions,
resources becomes idle. implementing HTCondor as a reusable core WRENCH service us-
ing the Developer API required only 613 lines of code. Similarly,
4.1. Implementing pegasus with WRENCH implementing a simulated version of Pegasus, including DAGMan,
was done with only 666 lines of code (127 of which are merely
Since Pegasus relies on HTCondor, first we have implemented parsing simulation configuration files). These numbers include
the HTCondor services as simulated core CI services, which to- both header and source files, but exclude comments. We argue
gether form a new Compute Service that exposes the WRENCH that the above corresponds to minor simulation software devel-
Developer API. This makes HTCondor available to any WMS im- opment efforts when considering the complexity of the system
plementation that is to be simulated using WRENCH, and has being simulated.
been included as part of the growing set of simulated core CI Service implementations in WRENCH are all parameterizable.
services provided by WRENCH. For instance, as services use message-based communications it
168 H. Casanova, R. Ferreira da Silva, R. Tanaka et al. / Future Generation Computer Systems 112 (2020) 162–175

is possible to specify all message payloads in bytes (e.g., for • Montage [2]: A compute-intensive astronomy workflow for
control messages). Other parameters encompass various over- generating custom mosaics of the sky. For this experiment,
heads, either in seconds or in computation volumes (e.g., task we ran Montage for processing 1.5 and 2.0 square degrees
startup overhead on a compute service). In WRENCH, service mosaic 2MASS. We thus refer to each configuration as
implementations come with default values for all these param- Montage-1.5 and Montage-2.0, respectively. Montage-1.5,
eters, but it is possible to pick custom values upon service in- resp. Montage-2.0, comprises 573, resp. 1240, tasks.
stantiation. The process of picking parameter values so as to • SAND [64]: A compute-intensive bioinformatics workflow
match a specific real-world system is referred to as simulation for accelerating genome assembly. For this experiment, we
calibration. We calibrated our simulator by measuring delays ran SAND for a full set of reads from the Anopheles gambiae
observed in event traces of real-world executions for workflows Mopti form. We consider a SAND instance that comprises
on hardware/software infrastructures (see Section 4.3). 606 tasks.
The simulator code, details on the simulation calibration pro- We use these platforms, deploying on each a submit node (which
cedure, and experimental scenarios used in the rest of this section runs Pegasus and DAGMan or WorkQueue, and HTCondor’s job
are all publicly available online [61]. submission and central manager services), four worker nodes (4
or 24 cores per node/shared file system), and a data node in the
4.2. Implementing WorkQueue with WRENCH WAN:

• ExoGENI: A widely distributed networked infrastructure-


WorkQueue implements a master-worker paradigm by which as-a-service testbed representative of a ‘‘bare metal’’ plat-
running worker processes can be assigned work dynamically. In form. Each worker node is a 4-core 2.0 GHz processor with
our simulator, we implement workers as pilot jobs, a popular 12 GiB of RAM. The bandwidth between the data node
mechanism for late-binding of computation to resources [57], and the submit node was ∼0.40 Gbps, and the bandwidth
which is implemented in WRENCH. As WorkQueue does not pro- between the submit and worker nodes was ∼1.00 Gbps.
vide automated mechanisms for starting workers, i.e., triggering • AWS: Amazon’s cloud platform, on which we use two types
pilot job submissions, at runtime, our simulated implementation of virtual machine instances: t2.xlarge and m5.xlarge.
limits the number of concurrently running pilot jobs based on The bandwidth between the data node and the submit
available compute resources (i.e., cores are not oversubscribed). node was ∼0.44 Gbps, and the bandwidth between the
The simulator implementation has two main components: (1) a submit and worker nodes on these instances were ∼0.74
workflow system for orchestrating tasks execution and assigning Gbps and ∼1.24 Gbps, respectively.
compute tasks to workers; and (2) a scheduler for submitting • Chameleon: An academic cloud testbed, on which we use
pilot jobs to compute services. As WRENCH provides a consistent homogeneous standard cloud units to run an HTCondor
interface across all compute services, our simulator can simulate pool. Each unit consists of a 24-core 2.3 GHz processor with
the execution of WorkQueue on any compute service provided 128 GiB of RAM. The bandwidth between the submit node
they are configured to support pilot job execution. and worker nodes on these instances were ∼10.00 Gbps.
Due to WRENCH’s providing high-level simulation abstrac-
To evaluate the accuracy of our simulators, we consider 4 par-
tions, implementing WorkQueue using the WRENCH Developer
ticular experimental scenarios: 1000Genome on ExoGENI,
API requires only 485 lines of code (113 of which are merely pars-
Montage-1.5 on AWS-t2.xlarge, Montage-2.0 on AWS-m5.xlarge,
ing simulation configuration files). These numbers include both and SAND on Chameleon. The first three scenarios are performed
header and source files, but exclude comments. Similarly to the using Pegasus, and the last one is performed using WorkQueue.
WRENCH-based Pegasus simulator, we argue that the simulation For each scenario we repeat the real-world workflow execution 5
software development effort is minimal when considering the times (since real-world executions are not perfectly determinis-
complexity of the system being simulated. The simulator code, tic), and also run a simulated execution using the WRENCH-based
details on the simulation calibration procedure, and experimental simulators described in the previous section. For each execution,
scenarios used in the rest of this section are all publicly available real-world or simulated, we keep track of the overall application
online [62]. makespan, but also of time-stamped individual execution events
such as task submission and completions dates.
4.3. Experimental scenarios
4.4. Pegasus: Simulation accuracy

We consider experimental scenarios defined by particular The fourth column in Table 1 shows average relative differ-
workflow instances to be executed on particular platforms. Due ences between actual and simulated makespans. We see that
to the lack of publicly available detailed workflow execution simulated makespans are close to actual makespans for all three
traces (i.e., execution logs that include data sizes for all files, Pegasus scenarios (average relative error is below 5%). One of
all execution delays, etc.), we have performed real workflow the key advantages of building WRENCH on top of SimGrid is
executions with Pegasus and WorkQueue, and collected raw, that WRENCH simulators benefit from the high-accuracy net-
time-stamped event traces from these executions. These traces work models in SimGrid. In particular, these models capture
form the ground truth to which we can compare simulated many features of the TCP protocol (without resorting to packet-
executions. We consider these workflow applications: level simulation). And indeed, when comparing real-world and
simulated executions we observe average relative error below
• 1000Genome [63]: A data-intensive workflow that identi- 3% for data movement operations. Furthermore, the many pro-
fies mutational overlaps using data from the 1000 genomes cesses involved in a workflow execution interact by exchanging
project in order to provide a null distribution for rigor- (typically small) control messages, and our simulators simulate
ous statistical evaluation of potential disease-related muta- these message exchanges. For instance, each time an output file
tions. We consider a 1000Genome instance that comprises is produced by a task a data registry service is contacted so
71 tasks. that a new entry can be added to its database of file replicas,
H. Casanova, R. Ferreira da Silva, R. Tanaka et al. / Future Generation Computer Systems 112 (2020) 162–175 169

Table 1
Average simulated makespan error (%), and p-values and Kolmogorov–Smirnov (KS) distances for task submission and completion dates, computed
for 5 runs of each of our 4 experimental scenarios.
Experimental scenario Avg. Makespan Task submissions Tasks completions
Workflow System Platform Error (%) p-value Distance p-value Distance
1000Genome Pegasus ExoGENI 1.10 ± 0.28 0.06 ± 0.01 0.21 ± 0.04 0.72 ± 0.06 0.12 ± 0.01
Montage-1.5 Pegasus AWS-t2.xlarge 4.25 ± 1.16 0.08 ± 0.01 0.16 ± 0.03 0.12 ± 0.05 0.21 ± 0.02
Montage-2.0 Pegasus AWS-m5.xlarge 3.37 ± 0.46 0.11 ± 0.03 0.06 ± 0.02 0.10 ± 0.01 0.11 ± 0.01
SAND WorkQueue Chameleon 3.96 ± 1.04 0.06 ± 0.01 0.11 ± 0.02 0.09 ± 0.02 0.09 ± 0.03

which incurs some communication overhead. When comparing


real-world to simulated executions we observe average relative
simulation error below 1% for these data registration overheads.
Overall, a key reason for the high accuracy of our simulators is
that they simulate data and control message transfers accurately.
To draw comparisons with a state-of-the-art simulator, we
repeated the above Pegasus simulation experiments using Work-
flowSim [35]. WorkflowSim simulates workflow executions based
on execution models similar to that of Pegasus and is built on
top of the CloudSim simulation framework [25]. However, Work-
flowSim does not provide a detailed simulated HTCondor imple-
mentation, and does not offer the same simulation calibration
capabilities as WRENCH. Nevertheless, we have painstakingly
calibrated the WorkflowSim simulator so that it models the hard-
ware and software infrastructures of our experimental scenarios Fig. 4. Empirical cumulative distribution function of task submit times (left) and
as closely as possible. For each of the 3 Pegasus experimental task completion times (right) for sample real-world (‘‘pegasus’’) and simulated
scenarios, we find that the relative average makespan percent- (‘‘wrench’’ and ‘‘workflowsim’’) executions of Montage-2.0 on AWS-m5.xlarge.
age error is 12.09 ± 2.84, 26.87 ± 6.26, and 13.32 ± 1.12,
respectively, i.e., from 4x up to 11x larger than the error values
obtained with our WRENCH-based simulator. The reasons for the Although KS tests and ECDFs visual inspections validate that
discrepancies between WorkflowSim and real-world results are the WRENCH-simulated ECDFs match the real-world ECDFs sta-
twofold. First, WorkflowSim uses the simplistic network mod- tistically, these results do not distinguish between individual
els in CloudSim (see discussion in Section 2) and thus suffers tasks. In fact, there are some discrepancies between real-world
from simulation bias w.r.t. communication times. Second, Work- and simulated schedules. For instance, Fig. 5 shows Gantt charts
flowSim does not capture all the relevant system details of the corresponding to the workflow executions shown in Fig. 4, with
system and of the workflow execution. By contrast, our WRENCH- the real-world execution on the left-hand side (‘‘pegasus’’) and
based simulator benefits from the accurate network simulation the simulated execution on the right-hand side (‘‘wrench’’). Tasks
models provided by SimGrid, and it does capture low-level rele- executions are shown on the vertical axis, each shown as a
vant details although implemented with only a few hundred lines line segment along the horizontal time axis, spanning the time
of code. between the task’s start time and the task’s finish time. Different
In our experiments, we also record the submission and com- task types, i.e., different executables, are shown with different
pletion dates of each task, thus obtaining empirical cumulative colors. In this workflow, all tasks of the same type are indepen-
density functions (ECDFs) of these times, for both real-world dent and have the same priority. We see that the shapes of the
executions and simulated executions. To further validate the ac- yellow regions, for example, vary between the two executions.
curacy of our simulation results, we apply Kolmogorov–Smirnov These variations are explained by implementation-dependent be-
goodness of fit tests (KS tests) with null hypotheses (H0 ) that haviors of the workflow scheduler. In many instances throughout
the real-world and simulation samples are drawn from the same workflow execution several ready tasks can be selected for ex-
distributions. The two-sample KS test results in a miss if the null ecution, e.g., sets of independent tasks in the same level of the
hypothesis (two-sided alternative hypothesis) is rejected at 5% workflow. When the number of available compute resources, n,
significance level (p-value ≤ 0.05). Each test for which the null is smaller than the number of ready tasks, the scheduler picks n
hypothesis is not rejected (p-value > 0.05), indicates that the ready tasks for immediate execution. In most WMSs, these tasks
simulated execution statistically matches the real-world execu- are picked as whatever first n tasks are returned when iterating
tion. Table 1 shows p-value and KS test distance for both task sub- over data structures in which task objects are stored. Building
mission times and task completion times. The null hypothesis is a perfectly faithful simulation of a WMS would thus entail im-
not rejected, and we thus conclude that simulated workflow task plementing/using the exact same data structures as that in the
executions statistically match real-world executions well. These actual implementation. This could be labor intensive or perhaps
conclusions are confirmed by visually comparing ECDFs. For in- not even possible depending on which data structures, languages,
stance, Fig. 4 shows real-world and simulated ECDFs for sample and/or libraries are used in that implementation. In the context
runs of Montage-2.0 on AWS-m5.xlarge, with task submission, of this Pegasus case study, the production implementation of the
resp. completion, date ECDFs on the left-hand, resp. right-hand, DAGMan scheduler uses a custom priority list implementation
side. We observe that the simulated ECDFs (‘‘wrench’’) track the to store ready tasks, while our simulation version of it stores
real-world ECDFs (‘‘pegasus’’) closely. We repeated these simu- workflow tasks in a C++ std::map data structure indexed by task
lations using WorkflowSim, and found that the null hypothesis string IDs. Consequently, when the real-world scheduler picks the
is rejected for all 3 simulation scenarios. This is confirmed visu- first n ready tasks it typically picks different tasks than those
ally in Fig. 4, where the ECDFs obtained from the WorkflowSim picked by its simulated implementation. This is the cause the
simulation (‘‘workflowsim’’) are far from the real-world ECDFs. discrepancies seen in Fig. 5.
170 H. Casanova, R. Ferreira da Silva, R. Tanaka et al. / Future Generation Computer Systems 112 (2020) 162–175

Fig. 5. Task execution Gantt chart for sample real-world (‘‘pegasus’’) and simu- Fig. 7. Task execution Gantt chart for sample real-world (‘‘workqueue’’) and
lated (‘‘wrench’’) executions of the Montage-2.0 workflow on the AWS-m5.xlarge simulated (‘‘wrench’’) executions of the SAND framework on the Chameleon
platform. Cloud platform.

the real-world execution logs (verifying this hypothesis would


require re-engineering the actual WorkQueue implementation to
improve its logging feature). More generally, the above highlights
the difficulties involved in defining what constitutes a sensible
ground truth and obtaining this ground truth via real-world
executions.
Fig. 7 shows Gantt charts corresponding to the SAND execu-
tions shown in Fig. 6. In the real-world execution, we observe the
same step function pattern seen in Fig. 6, for the same reasons.
But we also note other discrepancies between real-world and
simulated executions. For instance, at the end of the real-world
execution, there is a gap in the Gantt chart. This gap corresponds
to tasks with sub-second durations, which are not visible due to
the chart’s resolution. In the real-world executions, all these tasks
Fig. 6. Empirical cumulative distribution function of task submit times (left)
and task completion times (right) for sample real-world (‘‘workqueue’’) and
are submitted in sequence, hence the gap. By contrast, in the
simulated (‘‘wrench’’) executions of SAND on Chameleon Cloud. simulated execution these tasks are not submitted in sequence.
The reason is exactly the same phenomenon as that observed
for Pegasus in the previous section. During workflow execution
4.5. Workqueue: Simulation accuracy there are often more ready tasks than available workers, and
WorkQueue must pick some of these tasks for submissions. This
Similar to results obtained with our Pegasus simulator, we is done by removing the desired number of tasks from some
observe small (below 4%) average relative differences between data structure, and, here again the real-world WorkQueue im-
actual and simulated makespans using WorkQueue for execut- plementation and our simulation of it use different such data
ing the SAND workflow on the Chameleon cloud platform (4th structures: the former uses a std::vector, while the latter uses
experimental scenario in Table 1). The two-sample KS tests for a std::map.
both task submissions and completions indicate that simulated Overall, the result in this and in the previous sections show
workflow task executions statistically match real-world execu- that WRENCH makes it possible to accurately simulate the exe-
tions with the WorkQueue application execution framework. cution of scientific workflows on production platforms. This was
Fig. 6 shows real-world (‘‘workqueue’’) and simulated (‘‘wrench’’) demonstrated for two qualitatively different systems: the Pegasus
ECDFs of task submission, resp. completion, dates on the left- WMS (which uses a push model to perform early-binding of tasks
hand, resp. right-hand, side. The real-world and simulated task to compute resources) and the WorkQueue application execution
completion ECDFs closely match. However, for task submission framework (which uses a pull model to perform late-binding of
dates, although the real-world and simulated ECDFs show very tasks to compute resources).
similar trends, the former exhibits a step function pattern while
the latter does not. After investigating this inconsistency, we 4.6. Simulation scalability
found that in the real-world experiments execution events are
recorded in logs as jobs are generated, but not when they are In this section, we evaluate the speed and the memory foot-
actually submitted to the HTCondor pool. As a result, our ground print of WRENCH simulations. Results with WRENCH version 1.2
truth is biased for task submission dates as many tasks appear to were presented in the preliminary version of this work [48],
be submitted at once. An additional (small) source of discrepancy while results presented in this section are obtained with ver-
is that in the real-world execution monitoring data from the sion 1.4. Between these two versions a number of performance
HTCondor pool is pulled at regular intervals (about 5s), while improvements were made, which explains why the results pre-
in our simulated execution there is no such delay (this delay sented here are superior. About 30% of the performance gain is
could be added by modifying our simulator’s implementation). due to using more appropriate data structures (e.g., whenever
We hypothesize that the simulated execution may actually be possible replacing C++ std::map and std::set data structures
closer to the real-world execution than what is reported in by std::unordered_map and std::unordered_set since the
H. Casanova, R. Ferreira da Silva, R. Tanaka et al. / Future Generation Computer Systems 112 (2020) 162–175 171

Table 2
Simulated workflow makespans and simulation times averaged over 5 runs of each of our 4
experimental scenarios.
Experimental scenario Avg. workflow Avg. simulation
Workflow System Platform Makespan (s) Time (s)
1000Genome Pegasus ExoGENI 761.0 ± 7.93 0.3 ± 0.01
Montage-1.5 Pegasus AWS-t2.xlarge 1,784.0 ± 137.67 8.3 ± 0.09
Montage-2.0 Pegasus AWS-m5.xlarge 2,911.8 ± 48.80 28.1 ± 0.52
SAND WorkQueue Chameleon 5,339.2 ± 133.56 16.3 ± 0.86

Fig. 8 also includes results obtained with WorkflowSim. We


find that WorkflowSim has a larger memory footprint than our
WRENCH-based simulator (by a factor ∼1.41 for 10,000-task
workflows), and it is slower than our WRENCH-based simulator
(by a factor ∼1.76 for 10,000-task workflows). In our previous
work [48], WorkflowSim was faster than our WRENCH-based
simulator (by a factor ∼1.81 for 10,000-task workflows), with
roughly similar trends. The reason for this improvement is a set
of memory management optimizations that were applied to the
WRENCH implementation, and in particular for handling mes-
sage objects exchanged between processes. These optimizations
have significantly improved the scalability of WRENCH, and it
is likely for several other optimizations are possible to push
scalability further. These optimizations have also decreased the
workflow simulation time significantly when compared to the
earlier WRENCH release, e.g., by a factor ∼2.6 for 10,000-task
workflows.
Fig. 8. Average simulation time (in seconds. left vertical axis) and memory usage Overall, our experimental results show that WRENCH not only
(maximum resident set size, right vertical axis) in MiB vs. workflow size. yields accurate simulation results but also can scalably simu-
late the execution of large-scale complex scientific applications
running on heterogeneous platforms.
latter have O(1) average case operations). About 50% of the gain
is due to the removal of a data structure to keep track of all in- 5. Conclusion
flight (simulated) network messages, so as to avoid memory leaks
when simulating host failures. This feature is now only activated In this paper, we have presented WRENCH, a simulation frame-
whenever simulation of host failures is done. The remaining 20% work for building simulators of Workflow Management Systems.
is due to miscellaneous code improvements (e.g., avoiding pass- WRENCH implements high-level simulation abstractions on top
by-value parameters, using helper data structures to trade off of the SimGrid simulation framework, so as to make it possible
space for time). to build simulators that are accurate, that can run scalably on
Table 2 shows average simulated makespans and simulation a single computer, and that can be implemented with minimal
execution times for our 4 experimental scenarios. Simulations software development effort. Via case studies for the Pegasus pro-
duction WMS and WorkQueue application execution framework,
are executed on a single core of a MacBook Pro 3.5 GHz Intel
we have demonstrated that WRENCH achieves these objectives,
Core i7 with 16 GiB of RAM. For these scenarios, simulation times
and that it favorably compares to a recently proposed work-
are more than 100x and up to 2500x shorter than real-world
flow simulator. The main finding is that with WRENCH one can
workflow executions. This is because SimGrid simulates compu-
implement an accurate and scalable simulator of a complex real-
tation and communication operations as delays computed based
world system with a few hundred lines of code. WRENCH is open
on computation and communication volumes using simulation
source and welcomes contributors. WRENCH is already being
models with low computational complexity.
used for several research and education projects, and Version 1.5
To further evaluate the scalability of our simulator, we use a
was released in February 2020. We refer the reader to https:
workflow generator [65] to generate representative randomized //wrench-project.org for software, documentation, and links to
configurations of the Montage workflow with from 1000 up to related projects.
10,000 tasks. We generate 5 workflow instances for each number A short-term development direction is to use WRENCH to
of tasks, and simulate the execution of these generated workflow simulate the execution of current production WMSs and ap-
instances on 128 cores (AWS-m5.xlarge with 32 4-core nodes) plication execution frameworks (as was done for Pegasus and
using our WRENCH-based Pegasus simulator. Fig. 8 shows sim- WorkQueue in Section 4). Although we have designed WRENCH
ulation time (left vertical axis) and maximum resident set size with knowledge of many such systems in mind, we expect that
(right vertical axis) vs. the number of tasks in the workflow. WRENCH APIs and abstractions will evolve once we set out to
Each sample point is the average over the 5 workflow instances realize these implementations. Another development direction is
(error bars are shown as well). As expected, both simulation time the implementation of more CI service abstractions in WRENCH,
and memory footprint increase as workflows become larger. The e.g., a Hadoop Compute Service, specific distributed cloud Storage
memory footprint grows linearly with the number of tasks (sim- Services. From a research perspective, a future direction is that
ply due to the need to store more task objects). The simulation of automated simulation calibration. As seen in our Pegasus and
time grows faster initially, but then linearly beyond 7000 tasks. WorkQueue case studies, even when using validated simulation
We conclude that the simulation scales well, making it possible to models, the values of a number of simulation parameters must
simulate very large 10,000-task Montage configurations in under be carefully chosen in order to obtain accurate simulation re-
13 min on a standard laptop computer. sults. This issue is not confined to WRENCH, but is faced by
172 H. Casanova, R. Ferreira da Silva, R. Tanaka et al. / Future Generation Computer Systems 112 (2020) 162–175

Fig. 9. Example fully functional WRENCH simulator. Try-catch clauses are omitted.

all distributed system simulators. In our case studies, we have CRediT authorship contribution statement
calibrated these parameters manually by analyzing and compar-
ing simulated and real-world execution event traces. While, to Henri Casanova: Conceptualization, Methodology, Software,
the best of our knowledge, this is the typical practice, what is Validation, Formal analysis, Investigation, Data curation, Writing -
truly needed is an automated calibration method. Ideally, this original draft, Visualization, Funding acquisition. Rafael Ferreira
da Silva: Conceptualization, Methodology, Software, Validation,
method would process a (small) number of (not too large) real-
Formal analysis, Investigation, Data curation, Writing - original
world execution traces for ‘‘training scenarios’’, and compute a
draft, Visualization, Funding acquisition. Ryan Tanaka: Software,
valid and robust set of calibration parameter values. An important Writing - review & editing. Suraj Pandey: Software. Gautam Jeth-
research question will then be to understand to which extent wani: Software. William Koch: Software. Spencer Albrecht: Soft-
these automatically computed calibrations can be composed and ware. James Oeth: Software. Frédéric Suter: Software, Writing -
extrapolated to scenarios beyond the training scenarios. review & editing, Funding acquisition.
H. Casanova, R. Ferreira da Silva, R. Tanaka et al. / Future Generation Computer Systems 112 (2020) 162–175 173

Declaration of competing interest For brevity, the example in Fig. 9 omits try/catch clauses.
Also, note that although the simulator uses the new operator to
The authors declare that they have no known competing finan- instantiate WRENCH objects, the simulation object takes owner-
cial interests or personal relationships that could have appeared ship of these objects (using unique or shared pointers), so that
to influence the work reported in this paper. there is no memory deallocation onus placed on the user.

Acknowledgments References

This work is funded by National Science Foundation NSF, [1] I.J. Taylor, E. Deelman, D.B. Gannon, M. Shields, Workflows for E-Science:
Scientific Workflows for Grids, Springer Publishing Company, Incorporated,
USA contracts #1642369 and #1642335, ‘‘SI2-SSE: WRENCH: A 2007, http://dx.doi.org/10.1007/978-1-84628-757-2.
Simulation Workbench for Scientific Worflow Users, Developers, [2] E. Deelman, K. Vahi, G. Juve, M. Rynge, S. Callaghan, P.J. Maechling, R.
and Researchers’’; by CNRS, France under grant #PICS07239; Mayani, W. Chen, R. Ferreira da Silva, M. Livny, K. Wenger, Pegasus:
and partly funded by National Science Foundation NSF, USA a workflow management system for science automation, Future Gener.
Comput. Syst. 46 (2015) 17–35, http://dx.doi.org/10.1016/j.future.2014.10.
contracts #1923539 and #1923621: ‘‘CyberTraining: Implemen-
008.
tation: Small: Integrating core CI literacy and skills into university [3] T. Fahringer, R. Prodan, R. Duan, J. Hofer, F. Nadeem, F. Nerieri, S. Podlipnig,
curricula via simulation-driven activities’’. We thank Martin Quin- J. Qin, M. Siddiqui, H.-L. Truong, et al., Askalon: A development and
son, Arnaud Legrand, and Pierre-François Dutot for their valuable grid computing environment for scientific workflows, in: Workflows for
help. We also thank the NSF Chameleon Cloud, USA for providing E-Science, Springer, 2007, pp. 450–471, http://dx.doi.org/10.1007/978-1-
84628-757-2_27.
time grants to access their resources. [4] M. Wilde, M. Hategan, J.M. Wozniak, B. Clifford, D.S. Katz, I. Foster, Swift:
A language for distributed parallel scripting, Parallel Comput. 37 (9) (2011)
Appendix. Example WRENCH simulator 633–652, http://dx.doi.org/10.1016/j.parco.2011.05.005.
[5] K. Wolstencroft, R. Haines, D. Fellows, A. Williams, D. Withers, S. Owen, S.
Soiland-Reyes, I. Dunlop, A. Nenadic, P. Fisher, et al., The taverna workflow
An example WRENCH simulator developed using the WRENCH suite: designing and executing workflows of web services on the desktop,
User API (see Section 3.5) is shown in Fig. 9. This simulator uses web or in the cloud, Nucleic Acids Res. (2013) http://dx.doi.org/10.1093/
a WMS implementation (called SomeWMS) that has already been nar/gkt328, gkt328.
developed using the WRENCH Developer API (see Section 3.4). [6] I. Altintas, C. Berkley, E. Jaeger, M. Jones, B. Ludascher, S. Mock, Kepler:
an extensible system for design and execution of scientific workflows, in:
After initializing the simulation (lines 7–8), the simulator in- Scientific and Statistical Database Management, 2004. Proceedings. 16th
stantiates a platform (line 11) and a workflow (line 14–15). A International Conference on, IEEE, 2004, pp. 423–424, http://dx.doi.org/10.
workflow is defined as a set of computation tasks and data files, 1109/SSDM.2004.1311241.
with control and data dependencies between tasks. Each task [7] M. Albrecht, P. Donnelly, P. Bui, D. Thain, Makeflow: A portable abstrac-
tion for data intensive computing on clusters, clouds, and grids, in: 1st
can also have a priority, which can then be taken into account
ACM SIGMOD Workshop on Scalable Workflow Execution Engines and
by a WMS for scheduling purposes. Although the workflow can Technologies, ACM, 2012, p. 1, http://dx.doi.org/10.1145/2443416.2443417.
be defined purely programmatically, in this example the work- [8] N. Vydyanathan, U.V. Catalyurek, T.M. Kurc, P. Sadayappan, J.H. Saltz,
flow is imported from a workflow description file in the DAX Toward optimizing latency under throughput constraints for application
format [66]. At line 18 the simulator creates a storage service workflows on clusters, in: Euro-Par 2007 Parallel Processing, Springer,
2007, pp. 173–183, http://dx.doi.org/10.1007/978-3-540-74466-5_20.
with 1PiB capacity accessible on host storage_host. This and [9] A. Benoit, V. Rehn-Sonigo, Y. Robert, Optimizing latency and reliability
other hostnames are specified in the XML platform description of pipeline workflow applications, in: Parallel and Distributed Processing,
file. At line 22 the simulator creates a compute service that 2008. IPDPS 2008. IEEE International Symposium on, IEEE, 2008, pp. 1–10,
corresponds to a 4-node batch-scheduled cluster. The physical http://dx.doi.org/10.1109/IPDPS.2008.4536160.
[10] Y. Gu, Q. Wu, Maximizing workflow throughput for streaming applications
characteristics of the compute nodes (node[1-4]) are specified
in distributed environments, in: Computer Communications and Networks
in the platform description file. This compute service has a 1TiB (ICCCN), 2010 Proceedings of 19th International Conference on, IEEE, 2010,
scratch storage space. Its behavior is customized by passing a pp. 1–6, http://dx.doi.org/10.1109/ICCCN.2010.5560146.
couple of property-value pairs to its constructor. It will be subject [11] M. Malawski, G. Juve, W. Deelman, J. Nabrzyski, Algorithms for cost- and
to a background load as defined by a trace in the standard SWF deadline-constrained provisioning for scientific workflow ensembles in iaas
clouds, Future Gener. Comput. Syst. 48 (2015) 1–18, http://dx.doi.org/10.
format [67], and its batch queue will be managed using the 1016/j.future.2015.01.004.
EASY Backfilling scheduling algorithm [68]. The simulator then [12] J. Chen, Y. Yang, Temporal dependency-based checkpoint selection for
creates a second compute service (line 28), which is a 4-host dynamic verification of temporal constraints in scientific workflow sys-
cloud service with 4TiB scratch space, customized so that it does tems, ACM Trans. Softw. Eng. Methodol. (TOSEM) 20 (3) (2011) 9, http:
//dx.doi.org/10.1145/2000791.2000793.
not support pilot jobs. Two helper services are instantiated, a
[13] G. Kandaswamy, A. Mandal, D. Reed, et al., Fault tolerance and recovery of
data registry service so that the WMS can keep track of file scientific workflows on computational grids, in: Cluster Computing and the
locations (line 33) and a network monitoring service that uses Grid, 2008. CCGRID’08. 8th IEEE International Symposium on, IEEE, 2008,
the Vivaldi algorithm [69] to measure network distances between pp. 777–782, http://dx.doi.org/10.1109/CCGRID.2008.79.
the two hosts from which the compute services are accessed [14] R. Ferreira da Silva, T. Glatard, F. Desprez, Self-healing of workflow activity
incidents on distributed computing infrastructures, Future Gener. Comput.
(batch_login and cloud_gateway) and the my_host host, Syst. 29 (8) (2013) 2284–2294, http://dx.doi.org/10.1016/j.future.2013.06.
which is the host that runs these helper services and the WMS 012.
(line 36). At line 41, the simulator specifies that the workflow [15] W. Chen, R. Ferreira da Silva, E. Deelman, T. Fahringer, Dynamic and fault-
data file input_file is initially available at the storage service. tolerant clustering for scientific workflows, IEEE Trans. Cloud Comput. 4
(1) (2016) 49–62, http://dx.doi.org/10.1109/TCC.2015.2427200.
It then instantiates the WMS and passes to it all available services
[16] H.M. Fard, R. Prodan, J.J.D. Barrionuevo, T. Fahringer, A multi-objective
(line 44), and assigns the workflow to it (line 47). The crucial approach for workflow scheduling in heterogeneous environments, in:
call is at line 50, where the simulation is launched and the Proceedings of the 2012 12th IEEE/ACM International Symposium on
simulator hands off control to WRENCH. When this call returns Cluster, Cloud and Grid Computing (Ccgrid 2012), IEEE Computer Society,
the workflow has either completed or failed. Assuming it has 2012, pp. 300–309, http://dx.doi.org/10.1109/CCGrid.2012.114.
[17] I. Pietri, M. Malawski, G. Juve, E. Deelman, J. Nabrzyski, R. Sakellariou,
completed, the simulator then retrieves the ordered set of task Energy-constrained provisioning for scientific workflow ensembles, in:
completion events (line 53) and performs some (in this example, Cloud and Green Computing (CGC), 2013 Third International Conference
trivial) mining of these events (line 55). on, IEEE, 2013, pp. 34–41, http://dx.doi.org/10.1109/CGC.2013.14.
174 H. Casanova, R. Ferreira da Silva, R. Tanaka et al. / Future Generation Computer Systems 112 (2020) 162–175

[18] M. Tikir, M. Laurenzano, L. Carrington, A. Snavely, PSINS: An open source [39] G. Kecskemeti, S. Ostermann, R. Prodan, Fostering energy-awareness in
event tracer and execution simulator for mpi applications, in: Proc. of the simulations behind scientific workflow management systems, in: Proc. of
15th Intl. Euro-Par Conf. on Parallel Processing, in: LNCS, (5704) Springer, the 7th IEEE/ACM Intl. Conf. on Utility and Cloud Computing, 2014, pp.
2009, pp. 135–148, http://dx.doi.org/10.1007/978-3-642-03869-3_16. 29–38, http://dx.doi.org/10.1109/UCC.2014.11.
[19] T. Hoefler, T. Schneider, A. Lumsdaine, Loggopsim - simulating large- [40] J. Cao, S. Jarvis, S. Saini, G. Nudd, Gridflow: Workflow management for grid
scale applications in the loggops model, in: Proc. of the ACM Workshop
computing, in: Proc. of the 3rd IEEE/ACM Intl. Symp. on Cluster Computing
on Large-Scale System and Application Performance, 2010, pp. 597–604,
and the Grid (CCGrid), 2003, pp. 198–205.
http://dx.doi.org/10.1145/1851476.1851564.
[20] G. Zheng, G. Kakulapati, L. Kalé, Bigsim: A parallel simulator for perfor- [41] The simgrid project, 2019, Available at http://simgrid.org/.
mance prediction of extremely large parallel machines, in: Proc. of the [42] P. Bedaride, A. Degomme, S. Genaud, A. Legrand, G. Markomanolis, M.
18th Intl. Parallel and Distributed Processing Symposium (IPDPS), 2004, Quinson, M. Stillwell, F. Suter, B. Videau, Toward better simulation of
http://dx.doi.org/10.1109/IPDPS.2004.1303013. MPI applications on ethernet/tcp networks, in: Prod. of the 4th Intl.
[21] R. Bagrodia, E. Deelman, T. Phan, Parallel simulation of large-scale parallel Workshop on Performance Modeling, Benchmarking and Simulation of
applications, Int. J. High Perform. Comput. Appl. 15 (1) (2001) 3–12, High Performance Computer Systems, 2013, http://dx.doi.org/10.1007/978-
http://dx.doi.org/10.1177/109434200101500101. 3-319-10214-6_8.
[22] W.H. Bell, D.G. Cameron, A.P. Millar, L. Capozza, K. Stockinger, F. Zini, [43] P. Velho, L. Mello Schnorr, H. Casanova, A. Legrand, On the validity of
Optorsim - a grid simulator for studying dynamic data replication strate-
flow-level TCP network models for grid and cloud simulations, ACM Trans.
gies, Int. J. High Perform. Comput. Appl. 17 (4) (2003) 403–416, http:
Model. Comput. Simul. 23 (4) (2013) http://dx.doi.org/10.1145/2517448.
//dx.doi.org/10.1177/10943420030174005.
[23] R. Buyya, M. Murshed, Gridsim: A toolkit for the modeling and simulation [44] P. Velho, A. Legrand, Accuracy study and improvement of network sim-
of distributed resource management and scheduling for grid computing, ulation in the simgrid framework, in: Proc. of the 2nd Intl. Conf. on
Concurr. Comput.: Pract. Exper. 14 (13–15) (2003) 1175–1220, http://dx. Simulation Tools and Techniques, 2009, http://dx.doi.org/10.4108/ICST.
doi.org/10.1002/cpe.710. SIMUTOOLS2009.5592.
[24] S. Ostermann, R. Prodan, T. Fahringer, Dynamic cloud provisioning for [45] K. Fujiwara, H. Casanova, Speed and Accuracy of Network Simulation in
scientific grid workflows, in: Proc. of the 11th ACM/IEEE Intl. Conf. on the SimGrid Framework, in: Proc. of the 1st Intl. Workshop on Network
Grid Computing (Grid), 2010, pp. 97–104, http://dx.doi.org/10.1109/GRID. Simulation Tools, 2007.
2010.5697953. [46] A. Lèbre, A. Legrand, F. Suter, P. Veyre, Adding storage simulation Capacities
[25] R.N. Calheiros, R. Ranjan, A. Beloglazov, C.A.F. De Rose, R. Buyya, Cloudsim:
to the simgrid toolkit: Concepts, models, and API, in: Proc. of the 8th IEEE
A toolkit for modeling and simulation of cloud computing environments
Intl. Symp. on Cluster Computing and the Grid, 2015, http://dx.doi.org/10.
and evaluation of resource provisioning algorithms, Softw. - Pract. Exp. 41
1109/CCGrid.2015.134.
(1) (2011) 23–50, http://dx.doi.org/10.1002/spe.995.
[26] A. Nez, J. Vázquez-Poletti, A. Caminero, J. Carretero, I.M. Llorente, Design [47] The WRENCH project, 2020, https://wrench-project.org.
of a new cloud computing simulation platform, in: Proc. of the 11th Intl. [48] H. Casanova, S. Pandey, J. Oeth, R. Tanaka, F. Suter, R. Ferreira da
Conf. on Computational Science and Its Applications, 2011, pp. 582–593, Silva, WRENCH: A framework for simulating workflow management
http://dx.doi.org/10.1007/978-3-642-21931-3_45. systems, in: 13th Workshop on Workflows in Support of Large-Scale
[27] G. Kecskemeti, DISSECT-CF: A simulator to foster energy-aware scheduling Science (WORKS’18), 2018, pp. 74–85, http://dx.doi.org/10.1109/WORKS.
in infrastructure clouds, Simul. Model. Pract. Theory 58 (2) (2015) 188–218, 2018.00013.
http://dx.doi.org/10.1016/j.simpat.2015.05.009. [49] L. Yu, C. Moretti, A. Thrasher, S. Emrich, K. Judd, D. Thain, Harnessing par-
[28] A. Montresor, M. Jelasity, Peersim: A scalable p2p simulator, in: Proc. of allelism in multicore clusters with the all-pairs, wavefront, and makeflow
the 9th Intl. Conf. on Peer-To-Peer, 2009, pp. 99–100, http://dx.doi.org/10.
abstractions, Cluster Comput. 13 (3) (2010) 243–256, http://dx.doi.org/10.
1109/P2P.2009.5284506.
1007/s10586-010-0134-7.
[29] I. Baumgart, B. Heep, S. Krause, Oversim: A flexible overlay network sim-
ulation framework, in: Proc. of the 10th IEEE Global Internet Symposium, [50] The ns-3 Network Simulator, Available at http://www.nsnam.org.
IEEE, 2007, pp. 79–84, http://dx.doi.org/10.1109/GI.2007.4301435. [51] E. León, R. Riesen, A. Maccabe, P. Bridges, Instruction-level simulation of
[30] M. Taufer, A. Kerstens, T. Estrada, D. Flores, P.J. Teller, Simba: A dis- a cluster at scale, in: Proc. of the Intl. Conf. for High Performance Com-
crete event simulator for performance prediction of volunteer computing puting and Communications (SC), 2009, http://dx.doi.org/10.1145/1654059.
projects, in: Proc. of the 21st Intl. Workshop on Principles of Advanced 1654063.
and Distributed Simulation, 2007, pp. 189–197, http://dx.doi.org/10.1109/ [52] R. Fujimoto, Parallel discrete event simulation, Commun. ACM 33 (10)
PADS.2007.27. (1990) 30–53, http://dx.doi.org/10.1145/84537.84545.
[31] T. Estrada, M. Taufer, K. Reed, D.P. Anderson, Emboinc: An emulator for
[53] V. Cima, J. Beránek, S. Böhm, ESTEE: A Simulation Toolkit for Distributed
performance analysis of BOINC projects, in: Proc. of the Workshop on
Workflow Execution, in: Proceedings of the International Conference for
Large-Scale and Volatile Desktop Grids (PCGrid), 2009, http://dx.doi.org/
High Performance Computing, Networking, Storage, and Analysis (SC),
10.1109/IPDPS.2009.5161135.
[32] D. Kondo, SimBOINC: A simulator for desktop grids and volunteer Research Poster, 2019.
computing systems, 2007, Available at http://simboinc.gforge.inria.fr/. [54] S. Ostermann, G. Kecskemeti, R. Prodan, Multi-layered simulations at the
[33] H. Casanova, A. Giersch, A. Legrand, M. Quinson, F. Suter, Versatile, heart of workflow enactment on clouds, Concurr. Comput. Pract. Exp. 28
scalable, and accurate simulation of distributed applications and platforms, (2016) 3180—3201, http://dx.doi.org/10.1002/cpe.3733.
J. Parallel Distrib. Comput. 74 (10) (2014) 2899–2917, http://dx.doi.org/10. [55] R. Matha, S. Ristov, R. Prodan, Simulation of a workflow execution as a
1016/j.jpdc.2014.06.008. real cloud by adding noise, Simul. Model. Pract. Theory 79 (2017) 37–53,
[34] C.D. Carothers, D. Bauer, S. Pearce, ROSS: A high-performance, low mem- http://dx.doi.org/10.1016/j.simpat.2017.09.003.
ory, modular time warp system, in: Proc. of the 14th ACM/IEEE/SCS [56] L. Bobelin, A. Legrand, D.A.G. Márquez, P. Navarro, M. Quinson, F. Suter,
Workshop of Parallel on Distributed Simulation, 2000, pp. 53–60, http:
C. Thiery, Scalable multi-purpose network representation for large scale
//dx.doi.org/10.1109/PADS.2000.847144.
distributed system simulation, in: Proceedings of the 12th IEEE/ACM
[35] W. Chen, E. Deelman, Workflowsim: A toolkit for simulating scientific
International Symposium on Cluster, Cloud and Grid Computing (CCGrid),
workflows in distributed environments, in: Proc. of the 8th IEEE Intl.
Conf. on E-Science, 2012, pp. 1–8, http://dx.doi.org/10.1109/eScience.2012. Ottawa, Canada, 2012, pp. 220–227, http://dx.doi.org/10.1109/CCGrid.2012.
6404430. 31.
[36] A. Hirales-Carbajal, A. Tchernykh, T. Röblitz, R. Yahyapour, A grid sim- [57] M. Turilli, M. Santcroos, S. Jha, A comprehensive perspective on pilot-
ulation framework to study advance scheduling strategies for complex job systems, ACM Comput. Surv. 51 (2) (2018) http://dx.doi.org/10.1145/
workflow applications, in: In Proc. of IEEE Intl. Symp. on Parallel Dis- 3177851, 43:1–43:32.
tributed Processing Workshops (IPDPSW), 2010, http://dx.doi.org/10.1109/ [58] J. Frey, Condor DAGMan: Handling inter-job dependencies, Tech. Rep.,
IPDPSW.2010.5470918. University of Wisconsin, Dept. of Computer Science, 2002, URL http:
[37] M.-H. Tsai, K.-C. Lai, H.-Y. Chang, K. Fu Chen, K.-C. Huang, Pewss: A plat- //www.bo.infn.it/calcolo/condor/dagman/.
form of extensible workflow simulation service for workflow scheduling
[59] D. Thain, T. Tannenbaum, M. Livny, Distributed computing in practice: the
research, Softw. - Pract. Exp. 48 (4) (2017) 796–819, http://dx.doi.org/10.
condor experience, Concurr. Comput.: Pract. Exp. 17 (2–4) (2005) 323–356,
1002/spe.2555.
[38] S. Ostermann, K. Plankensteiner, D. Bodner, G. Kraler, R. Prodan, Integration http://dx.doi.org/10.1002/cpe.938.
of an event-based simulation framework into a scientific workflow execu- [60] B. Tovar, R. Ferreira da Silva, G. Juve, E. Deelman, W. Allcock, D. Thain,
tion environment for grids and clouds, in: In Proc. of the 4th ServiceWave M. Livny, A job sizing strategy for high-throughput scientific workflows,
European Conference, 2011, pp. 1–13, http://dx.doi.org/10.1007/978-3- IEEE Trans. Parallel Distrib. Syst. 29 (2) (2018) 240–253, http://dx.doi.org/
642-24755-2_1. 10.1109/TPDS.2017.2762310.
H. Casanova, R. Ferreira da Silva, R. Tanaka et al. / Future Generation Computer Systems 112 (2020) 162–175 175

[61] The WRENCH pegasus simulator, 2019, https://github.com/wrench-project/ Suraj Pandey obtained his Computer Science M.S. de-
pegasus. gree from the University of Hawaii at Manoa, and his
[62] The WRENCH workqueue simulator, 2019, https://github.com/wrench- undergraduate degree from the Institute of Engineering,
project/workqueue. Pulchowk Campus, Nepal.
[63] R. Ferreira da Silva, R. Filgueira, E. Deelman, E. Pairo-Castineira, I.M.
Overton, M. Atkinson, Using simple PID-inspired controllers for online
resilient resource management of distributed scientific workflows, Future
Gener. Comput. Syst. 95 (2019) 615–628, http://dx.doi.org/10.1016/j.future.
2019.01.015.
[64] C. Moretti, A. Thrasher, L. Yu, M. Olson, S. Emrich, D. Thain, A framework
for scalable genome assembly on clusters, clouds, and grids, IEEE Trans.
Gautam Jethwani is a Computer Science Under-
Parallel Distrib. Syst. 23 (12) (2012) 2189–2197, http://dx.doi.org/10.1109/ graduate student at the University of Southern
TPDS.2012.80. California.
[65] R. Ferreira da Silva, W. Chen, G. Juve, K. Vahi, E. Deelman, Community
resources for enabling and evaluating research on scientific workflows, in:
10th IEEE International Conference on E-Science, in: eScience’14, 2014, pp.
177–184, http://dx.doi.org/10.1109/eScience.2014.44.
[66] Pegasus’ DAX workflow description format, 2019, https://pegasus.isi.edu/
documentation/creating_workflows.php.
[67] The standard workload format, 2019, http://www.cs.huji.ac.il/labs/parallel/
workload/swf.html.
[68] D. Lifka, The ANL/ibm SP scheduling system, in: Proc. of the 1st Workshop William Kock is a Computer Science M.S. student at
on Job Scheduling Strategies for Parallel Processing, in: LCNS, vol. 949, the University of Hawaii at Manoa.
1995, pp. 295–303, http://dx.doi.org/10.1007/3-540-60153-8_35.
[69] F. Dabek, R. Cox, F. Kaashoek, R. Morris, Vivaldi: A decentralized network
coordinate system, in: Proc. of SIGCOMM, 2004, http://dx.doi.org/10.1145/
1015467.1015471.

Henri Casanova is a Professor in the Information


and Computer Science Department, the University of
Hawaii at Manoa. His research interests include the
area of parallel and distributed computing, with em- Spencer Albrecht is a Computer Science Undergraduate
phasis on the modeling and the simulation of platforms student at the University of Southern California.
and applications, as well as both the theoretical and
practical aspects of scheduling problems. See https:
//henricasanova.github.io/ for further information.

Rafael Ferreira da Silva is a Research Assistant Pro-


fessor in the Department of Computer Science at
University of Southern California, and a Research Lead
in the Science Automation Technologies group at the James Oeth is a Computer Science Undergraduate
USC Information Sciences Institute. His research fo- student at the University of Southern California.
cuses on the efficient execution of scientific workflows
on heterogeneous distributed systems (including high
performance and high throughput computing), com-
putational reproducibility, modeling and simulation of
parallel and distributed computing systems, and data
science — workflow performance analysis, user behav-
ior in HPC/HTC. Dr. Ferreira da Silva received his PhD in Computer Science
from INSA-Lyon, France, in 2013. For more information, please visit http://www.
rafaelsilva.com.

Ryan Tanaka is a programmer analyst in the Science Frederic Suter is a CNRS researcher at the IN2P3 Com-
Automation Technologies group at ISI. He received his puting Center in Lyon, France, since 2008. His research
Master’s degree in Computer Science from the Univer- interests include scheduling, Grid computing and plat-
sity of Hawaii at Manoa. His research interests include form and application simulation. He obtained his M.S.
distributed systems and data intensive applications. His from the Université Jules Verne, Amiens, France, in
current work has been focused on developing vari- 1999, his Ph.D. from the Ecole Normale Supérieure de
ous tools used in the Pegasus Workflow Management Lyon, France, in 2002 and his Habilitation Thesis from
System. the Ecole Normale Supérieure de Lyon, France in 2014.

You might also like