0% found this document useful (0 votes)
105 views16 pages

Sparrow: Distributed, Low Latency Scheduling

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
105 views16 pages

Sparrow: Distributed, Low Latency Scheduling

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Sparrow: Distributed, Low Latency Scheduling

Kay Ousterhout, Patrick Wendell, Matei Zaharia, Ion Stoica


University of California, Berkeley

2010: Dremel query


Abstract
2004: 2012: Impala query
2009: Hive
Large-scale data analytics frameworks are shifting to- MapReduce
query
wards shorter task durations and larger degrees of paral- batch job 2010: In-memory
Spark query
lelism to provide low latency. Scheduling highly parallel
jobs that complete in hundreds of milliseconds poses a
major challenge for task schedulers, which will need to 10 min. 10 sec. 100 ms 1 ms
schedule millions of tasks per second on appropriate ma-
chines while offering millisecond-level latency and high Figure 1: Data analytics frameworks can analyze
availability. We demonstrate that a decentralized, ran- large volumes of data with ever lower latency.
domized sampling approach provides near-optimal per-
formance while avoiding the throughput and availability
limitations of a centralized design. We implement and
deploy our scheduler, Sparrow, on a 110-machine clus- Jobs composed of short, sub-second tasks present a
ter and demonstrate that Sparrow performs within 12% difficult scheduling challenge. These jobs arise not only
of an ideal scheduler. due to frameworks targeting low latency, but also as a
result of breaking long-running batch jobs into a large
number of short tasks, a technique that improves fair-
1 Introduction ness and mitigates stragglers [17]. When tasks run in
hundreds of milliseconds, scheduling decisions must be
Today’s data analytics clusters are running ever shorter made at very high throughput: a cluster containing ten
and higher-fanout jobs. Spurred by demand for lower- thousand 16-core machines and running 100ms tasks
latency interactive data processing, efforts in re- may require over 1 million scheduling decisions per
search and industry alike have produced frameworks second. Scheduling must also be performed with low
(e.g., Dremel [12], Spark [26], Impala [11]) that stripe latency: for 100ms tasks, scheduling delays (includ-
work across thousands of machines or store data in ing queueing delays) above tens of milliseconds repre-
memory in order to analyze large volumes of data in sent intolerable overhead. Finally, as processing frame-
seconds, as shown in Figure 1. We expect this trend to works approach interactive time-scales and are used in
continue with a new generation of frameworks target- customer-facing systems, high system availability be-
ing sub-second response times. Bringing response times comes a requirement. These design requirements differ
into the 100ms range will enable powerful new appli- from those of traditional batch workloads.
cations; for example, user-facing services will be able Modifying today’s centralized schedulers to support
to run sophisticated parallel computations, such as lan- sub-second parallel tasks presents a difficult engineer-
guage translation and highly personalized search, on a ing challenge. Supporting sub-second tasks requires
per-query basis. handling two orders of magnitude higher throughput
than the fastest existing schedulers (e.g., Mesos [8],
Permission to make digital or hard copies of part or all of this work for YARN [16], SLURM [10]); meeting this design require-
personal or classroom use is granted without fee provided that copies
are not made or distributed for profit or commercial advantage and that
ment would be difficult with a design that schedules and
copies bear this notice and the full citation on the first page. Copyrights launches all tasks through a single node. Additionally,
for third-party components of this work must be honored. For all other achieving high availability would require the replication
uses, contact the Owner/Author. or recovery of large amounts of state in sub-second time.
Copyright is held by the Owner/Author(s). This paper explores the opposite extreme in the design
space: we propose scheduling from a set of machines
SOSP ’13, Nov. 3–6, 2013, Farmington, Pennsylvania, USA. that operate autonomously and without centralized or
ACM 978-1-4503-2388-8/13/11. logically centralized state. A decentralized design offers
[Link]
attractive scaling and availability properties. The system
aggregate fair shares, and isolates users with different
can support more requests by adding additional sched-
priorities such that a misbehaving low priority user in-
ulers, and if a scheduler fails, users can direct requests to
creases response times for high priority jobs by at most
an alternate scheduler. The key challenge with a decen-
40%. Simulation results suggest that Sparrow will con-
tralized design is providing response times comparable
tinue to perform well as cluster size increases to tens
to those provided by a centralized scheduler, given that
of thousands of cores. Our results demonstrate that dis-
concurrently operating schedulers may make conflicting
tributed scheduling using Sparrow presents a viable al-
scheduling decisions.
ternative to centralized scheduling for low latency, par-
We present Sparrow, a stateless distributed scheduler
allel workloads.
that adapts the power of two choices load balancing tech-
nique [14] to the domain of parallel task scheduling.
The power of two choices technique proposes schedul-
ing each task by probing two random servers and placing 2 Design Goals
the task on the server with fewer queued tasks. We intro-
duce three techniques to make the power of two choices This paper focuses on fine-grained task scheduling for
effective in a cluster running parallel jobs: low-latency applications.
Batch Sampling: The power of two choices performs Low-latency workloads have more demanding
poorly for parallel jobs because job response time is sen- scheduling requirements than batch workloads do,
sitive to tail task wait time (because a job cannot com- because batch workloads acquire resources for long pe-
plete until its last task finishes) and tail wait times remain riods of time and thus require infrequent task scheduling.
high with the power of two choices. Batch sampling To support a workload composed of sub-second tasks,
solves this problem by applying the recently developed a scheduler must provide millisecond-scale scheduling
multiple choices approach [18] to the domain of parallel delay and support millions of task scheduling decisions
job scheduling. Rather than sampling for each task indi- per second. Additionally, because low-latency frame-
vidually, batch sampling places the m tasks in a job on works may be used to power user-facing services, a
the least loaded of d · m randomly selected worker ma- scheduler for low-latency workloads should be able to
chines (for d > 1). We demonstrate analytically that, un- tolerate scheduler failure.
like the power of two choices, batch sampling’s perfor- Sparrow provides fine-grained task scheduling, which
mance does not degrade as a job’s parallelism increases. is complementary to the functionality provided by clus-
Late Binding: The power of two choices suffers from ter resource managers. Sparrow does not launch new
two remaining performance problems: first, server queue processes for each task; instead, Sparrow assumes that
length is a poor indicator of wait time, and second, due a long-running executor process is already running on
to messaging delays, multiple schedulers sampling in each worker machine for each framework, so that Spar-
parallel may experience race conditions. Late binding row need only send a short task description (rather than
avoids these problems by delaying assignment of tasks a large binary) when a task is launched. These execu-
to worker machines until workers are ready to run the tor processes may be launched within a static portion
task, and reduces median job response time by as much of a cluster, or via a cluster resource manager (e.g.,
as 45% compared to batch sampling alone. YARN [16], Mesos [8], Omega [20]) that allocates re-
Policies and Constraints: Sparrow uses multiple sources to Sparrow along with other frameworks (e.g.,
queues on worker machines to enforce global policies, traditional batch workloads).
and supports the per-job and per-task placement con- Sparrow also makes approximations when scheduling
straints needed by analytics frameworks. Neither pol- and trades off many of the complex features supported
icy enforcement nor constraint handling are addressed by sophisticated, centralized schedulers in order to pro-
in simpler theoretical models, but both play an impor- vide higher scheduling throughput and lower latency. In
tant role in real clusters [21]. particular, Sparrow does not allow certain types of place-
We have deployed Sparrow on a 110-machine clus- ment constraints (e.g., “my job should not be run on ma-
ter to evaluate its performance. When scheduling TPC- chines where User X’s jobs are running”), does not per-
H queries, Sparrow provides response times within 12% form bin packing, and does not support gang scheduling.
of an ideal scheduler, schedules with median queueing Sparrow supports a small set of features in a way that
delay of less than 9ms, and recovers from scheduler fail- can be easily scaled, minimizes latency, and keeps the
ures in less than 120ms. Sparrow provides low response design of the system simple. Many applications run low-
times for jobs with short tasks, even in the presence latency queries from multiple users, so Sparrow enforces
of tasks that take up to 3 orders of magnitude longer. strict priorities or weighted fair shares when aggregate
In spite of its decentralized design, Sparrow maintains demand exceeds capacity. Sparrow also supports basic

2
constraints over job placement, such as per-task con-
straints (e.g. each task needs to be co-resident with in- We assume a single wave job model when we evalu-
put data) and per-job constraints (e.g., all tasks must be ate scheduling techniques because single wave jobs are
placed on machines with GPUs). This feature set is simi- most negatively affected by the approximations involved
lar to that of the Hadoop MapReduce scheduler [23] and in our distributed scheduling approach: even a single
the Spark [26] scheduler. delayed task affects the job’s response time. However,
Sparrow also handles multiwave jobs.

3 Sample-Based Scheduling for


3.2 Per-task sampling
Parallel Jobs
Sparrow’s design takes inspiration from the power of
A traditional task scheduler maintains a complete view two choices load balancing technique [14], which pro-
of which tasks are running on which worker machines, vides low expected task wait times using a stateless, ran-
and uses this view to assign incoming tasks to avail- domized approach. The power of two choices technique
able workers. Sparrow takes a radically different ap- proposes a simple improvement over purely random as-
proach: many schedulers operate in parallel, and sched- signment of tasks to worker machines: place each task
ulers do not maintain any state about cluster load. To on the least loaded of two randomly selected worker
schedule a job’s tasks, schedulers rely on instantaneous machines. Assigning tasks in this manner improves ex-
load information acquired from worker machines. Spar- pected wait time exponentially compared to using ran-
row’s approach extends existing load balancing tech- dom placement [14].1
niques [14, 18] to the domain of parallel job scheduling We first consider a direct application of the power of
and introduces late binding to improve performance. two choices technique to parallel job scheduling. The
scheduler randomly selects two worker machines for
3.1 Terminology and job model each task and sends a probe to each, where a probe is
a lightweight RPC. The worker machines each reply to
We consider a cluster composed of worker machines that the probe with the number of currently queued tasks, and
execute tasks and schedulers that assign tasks to worker the scheduler places the task on the worker machine with
machines. A job consists of m tasks that are each allo- the shortest queue. The scheduler repeats this process for
cated to a worker machine. Jobs can be handled by any each task in the job, as illustrated in Figure 2(a). We refer
scheduler. Workers run tasks in a fixed number of slots; to this application of the power of two choices technique
we avoid more sophisticated bin packing because it adds as per-task sampling.
complexity to the design. If a worker machine is as- Per-task sampling improves performance compared to
signed more tasks than it can run concurrently, it queues random placement but still performs 2× or more worse
new tasks until existing tasks release enough resources than an omniscient scheduler.2 Intuitively, the problem
for the new task to be run. We use wait time to describe with per-task sampling is that a job’s response time is
the time from when a task is submitted to the sched- dictated by the longest wait time of any of the job’s tasks,
uler until when the task begins executing and service making average job response time much higher (and also
time to describe the time the task spends executing on much more sensitive to tail performance) than average
a worker machine. Job response time describes the time task response time. We simulated per-task sampling and
from when the job is submitted to the scheduler until random placement in cluster composed of 10,000 4-core
the last task finishes executing. We use delay to describe machines with 1ms network round trip time. Jobs ar-
the total delay within a job due to both scheduling and rive as a Poisson process and are each composed of 100
queueing. We compute delay by taking the difference be- tasks. The duration of a job’s tasks is chosen from the
tween the job response time using a given scheduling exponential distribution such that across jobs, task du-
technique, and job response time if all of the job’s tasks rations are exponentially distributed with mean 100ms,
had been scheduled with zero wait time (equivalent to but within a particular job, all tasks are the same du-
the longest service time across all tasks in the job).
In evaluating different scheduling approaches, we as- 1 More precisely, expected task wait time using random placement

sume that each job runs as a single wave of tasks. In is 1/(1 − ρ ), where ρ represents load. Using the least loaded of d
real clusters, jobs may run as multiple waves of tasks choices, wait time in an initially empty system over the first T units
d i −d
when, for example, m is greater than the number of slots of time is upper bounded by ∑∞ i=1 ρ
d−1 + o(1) [14].
2 The omniscient scheduler uses a greedy scheduling algorithm
assigned to the user; for multiwave jobs, the scheduler based on complete information about which worker machines are busy.
can place some early tasks on machines with longer For each incoming job, the scheduler places the job’s tasks on idle
queueing delay without affecting job response time. workers, if any exist, and otherwise uses FIFO queueing.

3
4 probes
Worker Worker
Scheduler Task 1 Scheduler (d = 2)
Worker Worker
Job Scheduler Worker Job Scheduler Worker
Task 2
Scheduler Worker Scheduler Worker


Worker Worker

Scheduler Scheduler
Worker Worker 2

(a) Per-task sampling selects queues of length 1 and 3. (b) Batch sampling selects queues of length 1 and 2.
Figure 2: Placing a parallel, two-task job. Batch sampling outperforms per-task sampling because tasks are
placed in the least loaded of the entire batch of sampled queues.

350
Random information from the probes sent for all of a job’s tasks,
300 Per-Task
Response Time (ms)

Batch and places the job’s m tasks on the least loaded of all the
250 Batch+Late Binding worker machines probed. In the example shown in Fig-
200 Omniscient ure 2, per-task sampling places tasks in queues of length
150 1 and 3; batch sampling reduces the maximum queue
100 length to 2 by using both workers that were probed by
Task 2 with per-task sampling.
50
To schedule using batch sampling, a scheduler ran-
0 domly selects dm worker machines (for d ≥ 1). The
0 0.2 0.4 0.6 0.8 1
Load scheduler sends a probe to each of the dm workers; as
with per-task sampling, each worker replies with the
Figure 3: Comparison of scheduling techniques in a number of queued tasks. The scheduler places one of the
simulated cluster of 10,000 4-core machines running job’s m tasks on each of the m least loaded workers. Un-
100-task jobs. less otherwise specified, we use d = 2; we justify this
choice of d in §7.9.
As shown in Figure 3, batch sampling improves per-
ration.3 As shown in Figure 3, response time increases
formance compared to per-task sampling. At 80% load,
with increasing load, because schedulers have less suc-
batch sampling provides response times 0.73× those
cess finding free machines on which to place tasks. At
with per-task sampling. Nonetheless, response times
80% load, per-task sampling improves performance by
with batch sampling remain a factor of 1.92× worse than
over 3× compared to random placement, but still results
those provided by an omniscient scheduler.
in response times equal to over 2.6× those offered by a
omniscient scheduler.
3.4 Problems with sample-based schedul-
3.3 Batch sampling ing
Batch sampling improves on per-task sampling by shar- Sample-based techniques perform poorly at high load
ing information across all of the probes for a particular due to two problems. First, schedulers place tasks based
job. Batch sampling is similar to a technique recently on the queue length at worker nodes. However, queue
proposed in the context of storage systems [18]. With length provides only a coarse prediction of wait time.
per-task sampling, one pair of probes may have gotten Consider a case where the scheduler probes two work-
unlucky and sampled two heavily loaded machines (e.g., ers to place one task, one of which has two 50ms tasks
Task 1 in Figure 2(a)), while another pair may have got- queued and the other of which has one 300ms task
ten lucky and sampled two lightly loaded machines (e.g, queued. The scheduler will place the task in the queue
Task 2 in Figure 2(a)); one of the two lightly loaded ma- with only one task, even though that queue will result
chines will go unused. Batch sampling aggregates load in a 200ms longer wait time. While workers could re-
3 We use this distribution because it puts the most stress on our
ply with an estimate of task duration rather than queue
approximate, distributed scheduling technique. When tasks within a
length, accurately predicting task durations is notori-
job are of different duration, the shorter tasks can have longer wait ously difficult. Furthermore, almost all task duration es-
times without affecting job response time. timates would need to be accurate for such a technique

4
to be effective, because each job includes many parallel and task runtimes are comparable, late binding will not
tasks, all of which must be placed on machines with low present a worthwhile tradeoff.
wait time to ensure good performance.
Sampling also suffers from a race condition where 3.6 Proactive Cancellation
multiple schedulers concurrently place tasks on a worker
that appears lightly loaded [13]. Consider a case where When a scheduler has launched all of the tasks for a par-
two different schedulers probe the same idle worker ma- ticular job, it can handle remaining outstanding probes
chine, w, at the same time. Since w is idle, both sched- in one of two ways: it can proactively send a cancel-
ulers are likely to place a task on w; however, only one lation RPC to all workers with outstanding probes, or
of the two tasks placed on the worker will arrive in an it can wait for the workers to request a task and reply
empty queue. The queued task might have been placed to those requests with a message indicating that no un-
in a different queue had the corresponding scheduler launched tasks remain. We use our simulation to model
known that w was not going to be idle when the task the benefit of using proactive cancellation and find that
arrived. proactive cancellation reduces median response time by
6% at 95% cluster load. At a given load ρ , workers are
busy more than ρ of the time: they spend ρ proportion of
3.5 Late binding time executing tasks, but they spend additional time re-
Sparrow introduces late binding to solve the aforemen- questing tasks from schedulers. Using cancellation with
tioned problems. With late binding, workers do not re- 1ms network RTT, a probe ratio of 2, and with tasks that
ply immediately to probes and instead place a reserva- are an average of 100ms long reduces the time work-
tion for the task at the end of an internal work queue. ers spend busy by approximately 1%; because response
When this reservation reaches the front of the queue, the times approach infinity as load approaches 100%, the
worker sends an RPC to the scheduler that initiated the 1% reduction in time workers spend busy leads to a no-
probe requesting a task for the corresponding job. The ticeable reduction in response times. Cancellation leads
scheduler assigns the job’s tasks to the first m workers to additional RPCs if a worker receives a cancellation for
to reply, and replies to the remaining (d − 1)m workers a reservation after it has already requested a task for that
with a no-op signaling that all of the job’s tasks have reservation: at 95% load, cancellation leads to 2% ad-
been launched. In this manner, the scheduler guarantees ditional RPCs. We argue that the additional RPCs are a
that the tasks will be placed on the m probed workers worthwhile tradeoff for the improved performance, and
where they will be launched soonest. For exponentially- the full Sparrow implementation includes cancellation.
distributed task durations at 80% load, late binding pro- Cancellation helps more when the ratio of network de-
vides response times 0.55× those with batch sampling, lay to task duration increases, so will become more im-
bringing response time to within 5% (4ms) of an omni- portant as task durations decrease, and less important as
scient scheduler (as shown in Figure 3). network delay decreases.
The downside of late binding is that workers are
idle while they are sending an RPC to request a new
task from a scheduler. All current cluster schedulers 4 Scheduling Policies and Con-
we are aware of make this tradeoff: schedulers wait to straints
assign tasks until a worker signals that it has enough
free resources to launch the task. In our target set- Sparrow aims to support a small but useful set of poli-
ting, this tradeoff leads to a 2% efficiency loss com- cies within its decentralized framework. This section
pared to queueing tasks at worker machines. The frac- describes support for two types of popular scheduler
tion of time a worker spends idle while requesting tasks policies: constraints over where individual tasks are
is (d · RTT)/(t + d · RTT) (where d denotes the num- launched and inter-user isolation policies to govern the
ber of probes per task, RTT denotes the mean network relative performance of users when resources are con-
round trip time, and t denotes mean task service time). In tended.
our deployment on EC2 with an un-optimized network
stack, mean network round trip time was 1 millisecond.
4.1 Handling placement constraints
We expect that the shortest tasks will complete in 100ms
and that scheduler will use a probe ratio of no more than Sparrow handles two types of constraints, per-job and
2, leading to at most a 2% efficiency loss. For our tar- per-task constraints. Such constraints are commonly re-
get workload, this tradeoff is worthwhile, as illustrated quired in data-parallel frameworks, for instance, to run
by the results shown in Figure 3, which incorporate net- tasks on a machine that holds the task’s input data
work delays. In environments where network latencies on disk or in memory. As mentioned in §2, Sparrow

5
does not support many types of constraints (e.g., inter- policies: strict priorities and weighted fair sharing. These
job constraints) supported by some general-purpose re- policies mirror those offered by other schedulers, includ-
source managers. ing the Hadoop Map Reduce scheduler [25].
Per-job constraints (e.g., all tasks should be run on Many cluster sharing policies reduce to using strict
a worker with a GPU) are trivially handled at a Spar- priorities; Sparrow supports all such policies by main-
row scheduler. Sparrow randomly selects the dm candi- taining multiple queues on worker nodes. FIFO, earliest
date workers from the subset of workers that satisfy the deadline first, and shortest job first all reduce to assign-
constraint. Once the dm workers to probe are selected, ing a priority to each job, and running the highest pri-
scheduling proceeds as described previously. ority jobs first. For example, with earliest deadline first,
Sparrow also handles jobs with per-task constraints, jobs with earlier deadlines are assigned higher priority.
such as constraints that limit tasks to run on machines Cluster operators may also wish to directly assign pri-
where input data is located. Co-locating tasks with input orities; for example, to give production jobs high prior-
data typically reduces response time, because input data ity and ad-hoc jobs low priority. To support these poli-
does not need to be transferred over the network. For cies, Sparrow maintains one queue for each priority at
jobs with per-task constraints, each task may have a dif- each worker node. When resources become free, Spar-
ferent set of machines on which it can run, so Sparrow row responds to the reservation from the highest prior-
cannot aggregate information over all of the probes in ity non-empty queue. This mechanism trades simplicity
the job using batch sampling. Instead, Sparrow uses per- for accuracy: nodes need not use complex gossip proto-
task sampling, where the scheduler selects the two ma- cols to exchange information about jobs that are waiting
chines to probe for each task from the set of machines to be scheduled, but low priority jobs may run before
that the task is constrained to run on, along with late high priority jobs if a probe for a low priority job ar-
binding. rives at a node where no high priority jobs happen to
Sparrow implements a small optimization over per- be queued. We believe this is a worthwhile tradeoff: as
task sampling for jobs with per-task constraints. Rather shown in §7.8, this distributed mechanism provides good
than probing individually for each task, Sparrow shares performance for high priority users. Sparrow does not
information across tasks when possible. For example, currently support preemption when a high priority task
consider a case where task 0 is constrained to run in arrives at a machine running a lower priority task; we
machines A, B, and C, and task 1 is constrained to run leave exploration of preemption to future work.
on machines C, D, and E. Suppose the scheduler probed Sparrow can also enforce weighted fair shares. Each
machines A and B for task 0, which were heavily loaded, worker maintains a separate queue for each user, and
and probed machines C and D for task 1, which were performs weighted fair queuing [6] over those queues.
both idle. In this case, Sparrow will place task 0 on ma- This mechanism provides cluster-wide fair shares in ex-
chine C and task 1 on machine D, even though both ma- pectation: two users using the same worker will get
chines were selected to be probed for task 1. shares proportional to their weight, so by extension, two
Although Sparrow cannot use batch sampling for jobs users using the same set of machines will also be as-
with per-task constraints, our distributed approach still signed shares proportional to their weight. We choose
provides near-optimal response times for these jobs, be- this simple mechanism because more accurate mecha-
cause even a centralized scheduler has only a small num- nisms (e.g., Pisces [22]) add considerable complexity;
ber of choices for where to place each task. Jobs with as we demonstrate in §7.7, Sparrow’s simple mechanism
per-task constraints can still use late binding, so the provides near-perfect fair shares.
scheduler is guaranteed to place each task on whichever
of the two probed machines where the task will run
sooner. Storage layers like HDFS typically replicate data 5 Analysis
on three different machines, so tasks that read input data
will be constrained to run on one of three machines Before delving into our experimental evaluation, we ana-
where the input data is located. As a result, even an lytically show that batch sampling achieves near-optimal
ideal, omniscient scheduler would only have one addi- performance, regardless of the task duration distribu-
tional choice for where to place each task. tion, given some simplifying assumptions. Section 3
demonstrated that Sparrow performs well, but only un-
4.2 Resource allocation policies der one particular workload; this section generalizes
those results to all workloads. We also demonstrate that
Cluster schedulers seek to allocate resources accord- with per-task sampling, performance decreases expo-
ing to a specific policy when aggregate demand for re- nentially with the number of tasks in a job, making it
sources exceeds capacity. Sparrow supports two types of poorly suited for parallel workloads.

6
n Number of servers in the cluster 10 tasks/job 100 tasks/job
1
ρ

Pr(zero wait time)


Load (fraction non-idle workers) 0.8
m Tasks per job 0.6
d Probes per task 0.4
t Mean task service time 0.2
ρ n/(mt) Mean request arrival rate 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Table 1: Summary of notation. Load Load
Random Per-Task Batch
Random Placement (1 − ρ )m
Per-Task Sampling (1 − ρ d )m Figure 4: Probability that a job will experience zero
wait time in a single-core environment using random
i=m (1 − ρ ) ρ
d·m
Batch Sampling ∑d·m i d·m−i
i placement, sampling 2 servers/task, and sampling 2m
Table 2: Probability that a job will experience zero machines to place an m-task job.
wait time under three different scheduling tech-
10 tasks/job 100 tasks/job
niques. 1

Pr(zero wait time)


0.8
0.6
To analyze the performance of batch and per-task 0.4
sampling, we examine the probability of placing all tasks 0.2
in a job on idle machines, or equivalently, providing zero 0
wait time. Quantifying how often our approach places 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Load Load
jobs on idle workers provides a bound on how Sparrow Random Per-Task Batch
performs compared to an optimal scheduler.
We make a few simplifying assumptions for the pur- Figure 5: Probability that a job will experience zero
pose of this analysis. We assume zero network delay, an wait time in a system of 4-core servers.
infinitely large number of servers, and that each server
runs one task at a time. Our experimental evaluation
shows results in the absence of these assumptions.
Mathematical analysis corroborates the results in §3
demonstrating that per-task sampling performs poorly Our analysis thus far has considered machines that can
for parallel jobs. The probability that a particular task is run only one task at a time; however, today’s clusters
placed on an idle machine is one minus the probability typically feature multi-core machines. Multicore ma-
that all probes hit busy machines: 1 − ρ d (where ρ rep- chines significantly improve the performance of batch
resents cluster load and d represents the probe ratio; Ta- sampling. Consider a model where each server can run
ble 1 summarizes notation). The probability that all tasks up to c tasks concurrently. Each probe implicitly de-
in a job are assigned to idle machines is (1 − ρ d )m (as scribes load on c processing units rather than just one,
shown in Table 2) because all m sets of probes must hit which increases the likelihood of finding an idle process-
at least one idle machine. This probability decreases ex- ing unit on which to run each task. To analyze perfor-
ponentially with the number of tasks in a job, rendering mance in a multicore environment, we make two simpli-
per-task sampling inappropriate for scheduling parallel fying assumptions: first, we assume that the probability
jobs. Figure 4 illustrates the probability that a job expe- that a core is idle is independent of whether other cores
riences zero wait time for both 10 and 100-task jobs, and on the same machine are idle; and second, we assume
demonstrates that the probability of experiencing zero that the scheduler places at most 1 task on each machine,
wait time for a 100-task job drops to < 2% at 20% load. even if multiple cores are idle (placing multiple tasks on
Batch sampling can place all of a job’s tasks on idle an idle machine exacerbates the “gold rush effect” where
machines at much higher loads than per-task sampling. many schedulers concurrently place tasks on an idle ma-
In expectation, batch sampling will be able to place all chine). Based on these assumptions, we can replace ρ in
m tasks in empty queues as long as d ≥ 1/(1 − ρ ). Cru- Table 2 with ρ c to obtain the results shown in Figure 5.
cially, this expression does not depend on the number These results improve dramatically on the single-core
of tasks in a job (m). Figure 4 illustrates this effect: for results: for batch sampling with 4 cores per machine and
both 10 and 100-task jobs, the probability of experienc- 100 tasks per job, batch sampling achieves near perfect
ing zero wait time drops from 1 to 0 at 50% load.4 performance (99.9% of jobs experience zero wait time)
4 With the larger, 100-task job, the drop happens more rapidly be- at up to 79% load. This result demonstrates that, under
cause the job uses more total probes, which decreases the variance in some simplifying assumptions, batch sampling performs
the proportion of probes that hit idle machines. well regardless of the distribution of task durations.

7
Application Node Application
Spark App X Spark Scheduler
Frontend Frontend … Frontend Frontend
submitR
equest()
Monitor Executor
Sparrow Scheduler Sparrow Scheduler
enqueueR reserve time
eservation(
)
queue time
Worker Worker Worker
Time get task
Sparrow Node Monitor Sparrow Node Monitor
… Sparrow Node Monitor getTask()

launchT
time
Spark App X Spark App X ask()
Executor Executor Executor Executor service
time
taskComplete()
taskComplete()
taskComplete()
Figure 6: Frameworks that use Sparrow are decom-
posed into frontends, which generate tasks, and ex- Figure 7: RPCs (parameters not shown) and timings
ecutors, which run tasks. Frameworks schedule jobs associated with launching a job. Sparrow’s external
by communicating with any one of a set of distributed interface is shown in bold text and internal RPCs are
Sparrow schedulers. Sparrow node monitors run on shown in grey text.
each worker machine and federate resource usage.

queries or job specifications (e.g., a SQL query) from ex-


6 Implementation ogenous sources (e.g., a data analyst, web service, busi-
ness application, etc.) and compile them into parallel
We implemented Sparrow to evaluate its performance tasks for execution on workers. Frontends are typically
on a cluster of 110 Amazon EC2 virtual machines. The distributed over multiple machines to provide high per-
Sparrow code, including scripts to replicate our exper- formance and availability. Because Sparrow schedulers
imental evaluation, is publicly available at http:// are lightweight, in our deployment, we run a scheduler
[Link]/radlab/sparrow. on each machine where an application frontend is run-
ning to ensure minimum scheduling latency.
Executor processes are responsible for executing
6.1 System components
tasks, and are long-lived to avoid startup overhead such
As shown in Figure 6, Sparrow schedules from a dis- as shipping binaries or caching large datasets in memory.
tributed set of schedulers that are each responsible for Executor processes for multiple frameworks may run co-
assigning tasks to workers. Because Sparrow does not resident on a single machine; the node monitor federates
require any communication between schedulers, arbi- resource usage between co-located frameworks. Spar-
trarily many schedulers may operate concurrently, and row requires executors to accept a launchTask()
users or applications may use any available scheduler RPC from a local node monitor, as shown in Figure 7;
to place jobs. Schedulers expose a service (illustrated in Sparrow uses the launchTask() RPC to pass on the
Figure 7) that allows frameworks to submit job schedul- task description (opaque to Sparrow) originally supplied
ing requests using Thrift remote procedure calls [1]. by the application frontend.
Thrift can generate client bindings in many languages,
so applications that use Sparrow for scheduling are not
6.2 Spark on Sparrow
tied to a particular language. Each scheduling request in-
cludes a list of task specifications; the specification for a In order to test Sparrow using a realistic workload,
task includes a task description and a list of constraints we ported Spark [26] to Sparrow by writing a Spark
governing where the task can be placed. scheduling plugin. This plugin is 280 lines of Scala
A Sparrow node monitor runs on each worker, and code, and can be found at [Link]
federates resource usage on the worker by enqueu- kayousterhout/spark/tree/sparrow.
ing reservations and requesting task specifications from The execution of a Spark query begins at a Spark
schedulers when resources become available. Node frontend, which compiles a functional query definition
monitors run tasks in a fixed number of slots; slots can into multiple parallel stages. Each stage is submitted as
be configured based on the resources of the underlying a Sparrow job, including a list of task descriptions and
machine, such as CPU cores and memory. any associated placement constraints. The first stage is
Sparrow performs task scheduling for one or more typically constrained to execute on machines that con-
concurrently operating frameworks. As shown in Fig- tain input data, while the remaining stages (which read
ure 6, frameworks are composed of long-lived frontend data shuffled or broadcasted over the network) are un-
and executor processes, a model employed by many constrained. When one stage completes, Spark requests
systems (e.g., Mesos [8]). Frontends accept high level scheduling of the tasks in the subsequent stage.

8
6.3 Fault tolerance
a TPC-H workload, which features heterogeneous an-
Because Sparrow schedulers do not have any logically alytics queries. We provide fine-grained tracing of the
centralized state, the failure of one scheduler does not af- overhead that Sparrow incurs and quantify its perfor-
fect the operation of other schedulers. Frameworks that mance in comparison with an ideal scheduler. Second,
were using the failed scheduler need to detect the failure we demonstrate Sparrow’s ability to handle scheduler
and connect to a backup scheduler. Sparrow includes a failures. Third, we evaluate Sparrow’s ability to isolate
Java client that handles failover between Sparrow sched- users from one another in accordance with cluster-wide
ulers. The client accepts a list of schedulers from the ap- scheduling policies. Finally, we perform a sensitivity
plication and connects to the first scheduler in the list. analysis of key parameters in Sparrow’s design.
The client sends a heartbeat message to the scheduler it
is using every 100ms to ensure that the scheduler is still
alive; if the scheduler has failed, the client connects to 7.1 Performance on TPC-H workload
the next scheduler in the list and triggers a callback at the
We measure Sparrow’s performance scheduling queries
application. This approach allows frameworks to decide
how to handle tasks that were in-flight during the sched- from the TPC-H decision support benchmark. The TPC-
uler failure. Some frameworks may choose to ignore H benchmark is representative of ad-hoc queries on busi-
failed tasks and proceed with a partial result; for Spark, ness data, which are a common use case for low-latency
the handler instantly relaunches any phases that were in- data parallel frameworks.
flight when the scheduler failed. Frameworks that elect Each TPC-H query is executed using Shark [24],
to re-launch tasks must ensure that tasks are idempotent, a large scale data analytics platform built on top of
because the task may have been partway through execu- Spark [26]. Shark queries are compiled into multiple
tion when the scheduler died. Sparrow does not attempt Spark stages that each trigger a scheduling request using
to learn about in-progress jobs that were launched by the Sparrow’s submitRequest() RPC. Tasks in the first
failed scheduler, and instead relies on applications to re- stage are constrained to run on one of three machines
launch such jobs. Because Sparrow is designed for short holding the task’s input data, while tasks in remaining
jobs, the simplicity benefit of this approach outweighs stages are unconstrained. The response time of a query
the efficiency loss from needing to restart jobs that were is the sum of the response times of each stage. Because
in the process of being scheduled by the failed scheduler. Shark is resource-intensive, we use EC2 high-memory
While Sparrow’s design allows for scheduler failures, quadruple extra large instances, which each have 8 cores
Sparrow does not provide any safeguards against rogue and 68.4GB of memory, and use 4 slots on each worker.
schedulers. A misbehaving scheduler could use a larger Ten different users launch random permutations of the
probe ratio to improve performance, at the expensive of TPC-H queries to sustain an average cluster load of 80%
other jobs. In trusted environments where schedulers are for a period of approximately 15 minutes. We report re-
run by a trusted entity (e.g., within a company), this sponse times from a 200 second period in the middle
should not be a problem; in more adversarial environ- of the experiment; during the 200 second period, Spar-
ments, schedulers may need to be authenticated and rate- row schedules over 20k jobs that make up 6.2k TPC-H
limited to prevent misbehaving schedulers from wasting queries. Each user runs queries on a distinct denormal-
resources. ized copy of the TPC-H dataset; each copy of the data set
Sparrow does not handle worker failures, as discussed is approximately 2GB (scale factor 2) and is broken into
in §8, nor does it handle the case where the entire clus- 33 partitions that are each triply replicated in memory.
ter fails. Because Sparrow does not persist scheduling The TPC-H query workload has four qualities repre-
state to disk, in the event that all machines in the clus- sentative of a real cluster workload. First, cluster utiliza-
ter fail (for example, due to a power loss event), all jobs tion fluctuates around the mean value of 80% depending
that were in progress will need to be restarted. As in the on whether the users are collectively in more resource-
case when a scheduler fails, the efficiency loss from this intensive or less resource-intensive stages. Second, the
approach is minimal because jobs are short. stages have different numbers of tasks: the first stage has
33 tasks, and subsequent stages have either 8 tasks (for
reduce-like stages that read shuffled data) or 1 task (for
7 Experimental Evaluation aggregation stages). Third, the duration of each stage is
non-uniform, varying from a few tens of milliseconds to
We evaluate Sparrow using a cluster composed of 100 several hundred. Finally, the queries have a mix of con-
worker machines and 10 schedulers running on Ama- strained and unconstrained scheduling requests: 6.2k re-
zon EC2. Unless otherwise specified, we use a probe quests are constrained (the first stage in each query) and
ratio of 2. First, we use Sparrow to schedule tasks for the remaining 14k requests are unconstrained.

9
Random Batch + late binding Reserve time Get task time
Per-task sampling Ideal Queue time Service time
Batch sampling 1

Cumulative Probability
4217 (med.) 5396 (med.) 7881 (med.)
4000 0.8
3500
Response Time (ms)

3000 0.6
2500
2000 0.4
1500
0.2
1000
500 0
0 1 10 100
q3 q4 q6 q12 Milliseconds
Figure 8: Response times for TPC-H queries using Figure 9: Latency distribution for each phase in the
different placement stategies. Whiskers depict 5th Sparrow scheduling algorithm.
and 95th percentiles; boxes depict median, 25th, and
535 219
75th percentiles. 140
Per-task
120 Sparrow
100

Delay (ms)
80
60
40
To evaluate Sparrow’s performance, we compare 20
Sparrow to an ideal scheduler that always places all tasks 0
with zero wait time, as described in §3.1. To compute the Constrained Stages Unconstrained Stages
ideal response time for a query, we compute the response Figure 10: Delay using both Sparrow and per-task
time for each stage if all of the tasks in the stage had been sampling, for both constrained and unconstrained
placed with zero wait time, and then sum the ideal re- Spark stages. Whiskers depict 5th and 95 percentiles;
sponse times for all stages in the query. Sparrow always boxes depict median, 25th, and 75th percentiles.
satisfies data locality constraints; because the ideal re-
sponse times are computed using the service times when
Sparrow executed the job, the ideal response time as-
7.2 Deconstructing performance
sumes data locality for all tasks. The ideal response time To understand the components of the delay that Spar-
does not include the time needed to send tasks to worker row adds relative to an ideal scheduler, we deconstruct
machines, nor does it include queueing that is inevitable Sparrow scheduling latency in Figure 9. Each line cor-
during utilization bursts, making it a conservative lower responds to one of the phases of the Sparrow schedul-
bound on the response time attainable with a centralized ing algorithm depicted in Figure 7. The reserve time
scheduler. and queue times are unique to Sparrow—a centralized
Figure 8 demonstrates that Sparrow outperforms al- scheduler might be able to reduce these times to zero.
ternate techniques and provides response times within However, the get task time is unavoidable: no matter the
12% of an ideal scheduler. Compared to randomly as- scheduling algorithm, the scheduler will need to ship the
signing tasks to workers, Sparrow (batch sampling with task to the worker machine.
late binding) reduces median query response time by 4–
8× and reduces 95th percentile response time by over 7.3 How do task constraints affect perfor-
10×. Sparrow also reduces response time compared to
mance?
per-task sampling (a naı̈ve implementation based on the
power of two choices): batch sampling with late bind- Sparrow provides good absolute performance and im-
ing provides query response times an average of 0.8× proves over per-task sampling for both constrained and
those provided by per-task sampling. Ninety-fifth per- unconstrained tasks. Figure 10 depicts the delay for con-
centile response times drop by almost a factor of two strained and unconstrained stages in the TPC-H work-
with Sparrow, compared to per-task sampling. Late bind- load using both Sparrow and per-task sampling. Sparrow
ing reduces median query response time by an average schedules with a median of 7ms of delay for jobs with
of 14% compared to batch sampling alone. Sparrow also unconstrained tasks and a median of 14ms of delay for
provides good absolute performance: Sparrow provides jobs with constrained tasks; because Sparrow cannot ag-
median response times just 12% higher than those pro- gregate information across the tasks in a job when tasks
vided by an ideal scheduler. are constrained, delay is longer. Nonetheless, even for

10
6000

Response Time (ms)


Query response time (ms) 4000 5000
Node 1
3000 Failure 4000
2000
3000
1000
2000 Spark native scheduler
0
4000 1000 Sparrow
Node 2 Ideal
3000 0
2000 6000 5000 4000 3000 2000 1000 0
Task Duration (ms)
1000
Figure 12: Response time when scheduling 10-task
0
0 10 20 30 40 50 60 jobs in a 100 node cluster using both Sparrow and
Time (s) Spark’s native scheduler. Utilization is fixed at 80%,
while task duration decreases.
Figure 11: TPC-H response times for two frontends
submitting queries to a 100-node cluster. Node 1 suf-
fers from a scheduler failure at 20 seconds. experiments run on a cluster of 110 EC2 servers, with 10
schedulers and 100 workers.

constrained tasks, Sparrow provides a performance im- 7.6 How does Sparrow compare to Spark’s
provement over per-task sampling due to its use of late
binding.
native, centralized scheduler?
Even in the relatively small, 100-node cluster in which
7.4 How do scheduler failures impact job we conducted our evaluation, Spark’s existing central-
response time? ized scheduler cannot provide high enough throughput
to support sub-second tasks.5 We use a synthetic work-
Sparrow provides automatic failover between schedulers load where each job is composed of 10 tasks that each
and can failover to a new scheduler in less than 120ms. sleep for a specified period of time, and measure job re-
Figure 11 plots the response time for ongoing TPC-H sponse time. Since all tasks in the job are the same du-
queries in an experiment parameterized as in §7.1, with ration, ideal job response time (if all tasks are launched
10 Shark frontends that submit queries. Each frontend immediately) is the duration of a single task. To stress
connects to a co-resident Sparrow scheduler but is ini- the schedulers, we use 8 slots on each machine (one per
tialized with a list of alternate schedulers to connect to in core). Figure 12 depicts job response time as a function
case of failure. At time t=20, we terminate the Sparrow of task duration. We fix cluster load at 80%, and vary
scheduler on node 1. The plot depicts response times for task submission rate to keep load constant as task du-
jobs launched from the Spark frontend on node 1, which ration decreases. For tasks longer than 2 seconds, Spar-
fails over to the scheduler on node 2. The plot also shows row and Spark’s native scheduler both provide near-ideal
response times for jobs launched from the Spark fron- response times. However, when tasks are shorter than
tend on node 2, which uses the scheduler on node 2 for 1355ms, Spark’s native scheduler cannot keep up with
the entire duration of the experiment. When the Sparrow the rate at which tasks are completing so jobs experience
scheduler on node 1 fails, it takes 100ms for the Spar- infinite queueing.
row client to detect the failure, less than 5ms to for the To ensure that Sparrow’s distributed scheduling is
Sparrow client to connect to the scheduler on node 2, necessary, we performed extensive profiling of the Spark
and less than 15ms for Spark to relaunch all outstand- scheduler to understand how much we could increase
ing tasks. Because of the speed at which failure recov- scheduling throughput with improved engineering. We
ery occurs, only 2 queries have tasks in flight during the did not find any one bottleneck in the Spark sched-
failure; these queries suffer some overhead. uler; instead, messaging overhead, virtual function call
overhead, and context switching lead to a best-case
7.5 Synthetic workload throughput (achievable when Spark is scheduling only
a single job) of approximately 1500 tasks per second.
The remaining sections evaluate Sparrow using a syn- Some of these factors could be mitigated, but at the ex-
thetic workload composed of jobs with constant dura- pense of code readability and understandability. A clus-
tion tasks. In this workload, ideal job completion time 5 For these experiments, we use Spark’s standalone mode, which
is always equal to task duration, which helps to isolate relies on a simple, centralized scheduler. Spark also allows for schedul-
the performance of Sparrow from application-layer vari- ing using Mesos; Mesos is more heavyweight and provides worse per-
ations in service time. As in previous experiments, these formance than standalone mode for short tasks.

11
400
Running Tasks 350 HP LP HP response LP response
300 load load time in ms time in ms
250 User 0 0.25 0 106 (111) N/A
200
150
User 1 0.25 0.25 108 (114) 108 (115)
100 0.25 0.5 110 (148) 110 (449)
50 0.25 0.75 136 (170) 40.2k (46.2k)
0 0.25 1.75 141 (226) 255k (270k)
0 10 20 30 40 50
Time (s) Table 3: Median and 95th percentile (shown in paren-
Figure 13: Cluster share used by two users that are theses) response times for a high priority (HP) and
each assigned equal shares of the cluster. User 0 sub- low priority (LP) user running jobs composed of 10
mits at a rate to utilize the entire cluster for the entire 100ms tasks in a 100-node cluster. Sparrow success-
experiment while user 1 adjusts its submission rate fully shields the high priority user from a low prior-
each 10 seconds. Sparrow assigns both users their ity user. When aggregate load is 1 or more, response
max-min fair share. time will grow to be unbounded for at least one user.

ter with tens of thousands of machines running sub-


second tasks may require millions of scheduling deci-
sions per second; supporting such an environment would over short time intervals. Nonetheless, as shown in Fig-
require 1000× higher scheduling throughput, which is ure 13, Sparrow quickly allocates enough resources to
difficult to imagine even with a significant rearchitecting User 1 when she begins submitting scheduling requests
of the scheduler. Clusters running low latency workloads (10 seconds into the experiment), and the cluster share
will need to shift from using centralized task schedulers allocated by Sparrow exhibits only small fluctuations
like Spark’s native scheduler to using more scalable dis- from the correct fair share.
tributed schedulers like Sparrow.

7.7 How well can Sparrow’s distributed 7.8 How much can low priority users hurt
fairness enforcement maintain fair response times for high priority users?
shares?
Table 3 demonstrates that Sparrow provides response
Figure 13 demonstrates that Sparrow’s distributed fair- times within 40% of an ideal scheduler for a high priority
ness mechanism enforces cluster-wide fair shares and user in the presence of a misbehaving low priority user.
quickly adapts to changing user demand. Users 0 and This experiment uses workers that each have 16 slots.
1 are both given equal shares in a cluster with 400 slots. The high priority user submits jobs at a rate to fill 25%
Unlike other experiments, we use 100 4-core EC2 ma- of the cluster, while the low priority user increases her
chines; Sparrow’s distributed enforcement works better submission rate to well beyond the capacity of the clus-
as the number of cores increases, so to avoid over stating ter. Without any isolation mechanisms, when the aggre-
performance, we evaluate it under the smallest number gate submission rate exceeds the cluster capacity, both
of cores we would expect in a cluster today. User 0 sub- users would experience infinite queueing. As described
mits at a rate to fully utilize the cluster for the entire in §4.2, Sparrow node monitors run all queued high pri-
duration of the experiment. User 1 changes her demand ority tasks before launching any low priority tasks, al-
every 10 seconds: she submits at a rate to consume 0%, lowing Sparrow to shield high priority users from mis-
25%, 50%, 25%, and finally 0% of the cluster’s available behaving low priority users. While Sparrow prevents the
slots. Under max-min fairness, each user is allocated her high priority user from experiencing infinite queueing
fair share of the cluster unless the user’s demand is less delay, the high priority user still experiences 40% worse
than her share, in which case the unused share is dis- response times when sharing with a demanding low pri-
tributed evenly amongst the remaining users. Thus, user ority user than when running alone on the cluster. This is
1’s max-min share for each 10-second interval is 0 con- because Sparrow does not use preemption: high priority
currently running tasks, 100 tasks, 200 tasks, 100 tasks, tasks may need to wait to be launched until low prior-
and finally 0 tasks; user 0’s max-min fair share is the re- ity tasks complete. In the worst case, this wait time may
maining resources. Sparrow’s fairness mechanism lacks be as long as the longest running low-priority task. Ex-
any central authority with a complete view of how many ploring the impact of preemption is a subject of future
tasks each user is running, leading to imperfect fairness work.

12
9279 (95th) 678 (med.), 4212 (95th)
300
574 (med.), 16 core, 50% long 4 cores, 50% long
4169 (95th) 4 cores, 10% long
Response Time (ms)

250

Short Job Resp. Time (ms)


200 300 666 2466 6278
250
150 200
150
100
Ideal 100
50 50
80% load 90% load 0
0 10s 100s
1 1.1 1.2 1.5 2 3 Duration of Long Tasks
Probe Ratio Figure 15: Sparrow provides low median response
Figure 14: Effect of probe ratio on job response time time for jobs composed of 10 100ms tasks, even when
at two different cluster loads. Whiskers depict 5th those tasks are run alongside much longer jobs. Er-
and 95th percentiles; boxes depict median, 25th, and ror bars depict 5th and 95th percentiles.
75th percentiles.

to sustain 80% cluster load. Figure 15 illustrates the re-


7.9 How sensitive is Sparrow to the probe sponse time of short jobs when sharing the cluster with
ratio? long jobs. We vary the percentage of jobs that are long,
Changing the probe ratio affects Sparrow’s performance the duration of the long jobs, and the number of cores
most at high cluster load. Figure 14 depicts response on the machine, to illustrate where performance breaks
time as a function of probe ratio in a 110-machine clus- down. Sparrow provides response times for short tasks
ter of 8-core machines running the synthetic workload within 11% of ideal (100ms) when running on 16-core
(each job has 10 100ms tasks). The figure demonstrates machines, even when 50% of tasks are 3 orders of mag-
that using a small amount of oversampling significantly nitude longer. When 50% of tasks are 3 orders of magni-
improves performance compared to placing tasks ran- titude longer, over 99% of the execution time across all
domly: oversampling by just 10% (probe ratio of 1.1) jobs is spent executing long tasks; given this, Sparrow’s
reduces median response time by more than 2.5× com- performance is impressive. Short tasks see more signifi-
pared to random sampling (probe ratio of 1) at 90% load. cant performance degredation in a 4-core environment.
The figure also demonstrates a sweet spot in the probe
ratio: a low probe ratio negatively impacts performance 7.11 Scaling to large clusters
because schedulers do not oversample enough to find
lightly loaded machines, but additional oversampling We used simulation to evaluate Sparrow’s performance
eventually hurts performance due to increased messag- in larger clusters. Figure 3 suggests that Sparrow will
ing. This effect is most apparent at 90% load; at 80% continue to provide good performance in a 10,000 node
load, median response time with a probe ratio of 1.1 is cluster; of course, the only way to conclusively evaluate
just 1.4× higher than median response time with a larger Sparrow’s performance at scale will be to deploy it on a
probe ratio of 2. We use a probe ratio of 2 throughout large cluster.
our evaluation to facilitate comparison with the power
of two choices and because non-integral probe ratios are
not possible with constrained tasks. 8 Limitations and Future Work
To handle the latency and throughput demands of low-
7.10 Handling task heterogeneity
latency frameworks, our approach sacrifices features
Sparrow does not perform as well under extreme task available in general purpose resource managers. Some of
heterogeneity: if some workers are running long tasks, these limitations of our approach are fundamental, while
Sparrow schedulers are less likely to find idle machines others are the focus of future work.
on which to run tasks. Sparrow works well unless a large Scheduling policies When a cluster becomes over-
fraction of tasks are long and the long tasks are many or- subscribed, Sparrow supports aggregate fair-sharing or
ders of magnitude longer than the short tasks. We ran priority-based scheduling. Sparrow’s distributed setting
a series of experiments with two types of jobs: short lends itself to approximated policy enforcement in or-
jobs, composed of 10 100ms tasks, and long jobs, com- der to minimize system complexity; exploring whether
posed of 10 tasks of longer duration. Jobs are submitted Sparrow can provide more exact policy enforcement

13
without adding significant complexity is a focus of fu- rely on centralized architectures. Among logically de-
ture work. Adding pre-emption, for example, would be a centralized schedulers, Sparrow is the first to sched-
simple way to mitigate the effects of low-priority users’ ule all of a job’s tasks together, rather than scheduling
jobs on higher priority users. each task independently, which improves performance
Constraints Our current design does not handle inter- for parallel jobs.
job constraints (e.g. “the tasks for job A must not run on Dean’s work on reducing the latency tail in serving
racks with tasks for job B”). Supporting inter-job con- systems [5] is most similar to ours. He proposes using
straints across frontends is difficult to do without signif- hedged requests where the client sends each request to
icantly altering Sparrow’s design. two workers and cancels remaining outstanding requests
Gang scheduling Some applications require gang when the first result is received. He also describes tied
scheduling, a feature not implemented by Sparrow. Gang requests, where clients send each request to two servers,
scheduling is typically implemented using bin-packing but the servers communicate directly about the status of
algorithms that search for and reserve time slots in which the request: when one server begins executing the re-
an entire job can run. Because Sparrow queues tasks on quest, it cancels the counterpart. Both mechanisms are
several machines, it lacks a central point from which similar to Sparrow’s late binding, but target an envi-
to perform bin-packing. While Sparrow often places all ronment where each task needs to be scheduled inde-
jobs on entirely idle machines, this is not guaranteed, pendently (for data locality), so information cannot be
and deadlocks between multiple jobs that require gang shared across the tasks in a job.
scheduling may occur. Sparrow is not alone: many clus- Work on load sharing in distributed systems (e.g., [7])
ter schedulers do not support gang scheduling [8, 9, 16]. also uses randomized techniques similar to Sparrow’s.
Query-level policies Sparrow’s performance could be In load sharing systems, each processor both generates
improved by adding query-level scheduling policies. A and processes work; by default, work is processed where
user query (e.g., a SQL query executed using Shark) it is generated. Processors re-distribute queued tasks if
may be composed of many stages that are each exe- the number of tasks queued at a processor exceeds some
cuted using a separate Sparrow scheduling request; to threshold, using either receiver-initiated policies, where
optimize query response time, Sparrow should sched- lightly loaded processors request work from randomly
ule queries in FIFO order. Currently, Sparrow’s algo- selected other processors, or sender-initiated policies,
rithm attempts to schedule jobs in FIFO order; adding where heavily loaded processors offload work to ran-
query-level scheduling policies should improve end-to- domly selected recipients. Sparrow represents a combi-
end query performance. nation of sender-initiated and receiver-initiated policies:
Worker failures Handling worker failures is compli- schedulers (“senders”) initiate the assignment of tasks
cated by Sparrow’s distributed design, because when a to workers (“receivers”) by sending probes, but work-
worker fails, all schedulers with outstanding requests ers finalize the assignment by responding to probes and
at that worker must be informed. We envision handling requesting tasks as resources become available.
worker failures with a centralized state store that relies Projects that explore load balancing tasks in multi-
on occasional heartbeats to maintain a list of currently processor shared-memory architectures (e.g., [19]) echo
alive workers. The state store would periodically dissem- many of the design tradeoffs underlying our approach,
inate the list of live workers to all schedulers. Since the such as the need to avoid centralized scheduling points.
information stored in the state store would be soft state, They differ from our approach because they focus
it could easily be recreated in the event of a state store on a single machine where the majority of the ef-
failure. fort is spent determining when to reschedule processes
Dynamically adapting the probe ratio Sparrow amongst cores to balance load.
could potentially improve performance by dynamically Quincy [9] targets task-level scheduling in compute
adapting the probe ratio based on cluster load; however, clusters, similar to Sparrow. Quincy maps the schedul-
such an approach sacrifices some of the simplicity of ing problem onto a graph in order to compute an optimal
Sparrow’s current design. Exploring whether dynami- schedule that balances data locality, fairness, and starva-
cally changing the probe ratio would significantly in- tion freedom. Quincy’s graph solver supports more so-
crease performance is the subject of ongoing work. phisticated scheduling policies than Sparrow but takes
over a second to compute a scheduling assignment in
a 2500 node cluster, making it too slow for our target
9 Related Work workload.
In the realm of data analytics frameworks,
Scheduling in distributed systems has been extensively Dremel [12] achieves response times of seconds
studied in earlier work. Most existing cluster schedulers with extremely high fanout. Dremel uses a hierarchical

14
scheduler design whereby each query is decomposed and strict priorities. Experiments using a synthetic work-
into a serving tree; this approach exploits the inter- load demonstrate that Sparrow is resilient to different
nal structure of Dremel queries so is not generally probe ratios and distributions of task durations. In light
applicable. of these results, we believe that distributed scheduling
Many schedulers aim to allocate resources at coarse using Sparrow presents a viable alternative to central-
granularity, either because tasks tend to be long-running ized schedulers for low latency parallel workloads.
or because the cluster supports many applications
that each acquire some amount of resources and per-
form their own task-level scheduling (e.g., Mesos [8], 11 Acknowledgments
YARN [16], Omega [20]). These schedulers sacrifice re-
quest granularity in order to enforce complex schedul- We are indebted to Aurojit Panda for help with debug-
ing policies; as a result, they provide insufficient latency ging EC2 performance anomalies, Shivaram Venkatara-
and/or throughput for scheduling sub-second tasks. High man for insightful comments on several drafts of this pa-
performance computing schedulers fall into this cate- per and for help with Spark integration, Sameer Agarwal
gory: they optimize for large jobs with complex con- for help with running simulations, Satish Rao for help
straints, and target maximum throughput in the tens with theoretical models of the system, and Peter Bailis,
to hundreds of scheduling decisions per second (e.g., Ali Ghodsi, Adam Oliner, Sylvia Ratnasamy, and Colin
SLURM [10]). Similarly, Condor supports complex fea- Scott for helpful comments on earlier drafts of this paper.
tures including a rich constraint language, job check- We also thank our shepherd, John Wilkes, for helping to
pointing, and gang scheduling using a heavy-weight shape the final version of the paper. Finally, we thank
matchmaking process that results in maximum schedul- the reviewers from HotCloud 2012, OSDI 2012, NSDI
ing throughput of 10 to 100 jobs per second [4]. 2013, and SOSP 2013 for their helpful feedback.
In the theory literature, a substantial body of work This research is supported in part by a Hertz Founda-
analyzes the performance of the power of two choices tion Fellowship, the Department of Defense through the
load balancing technique, as summarized by Mitzen- National Defense Science & Engineering Graduate Fel-
macher [15]. To the best of our knowledge, no exist- lowship Program, NSF CISE Expeditions award CCF-
ing work explores performance for parallel jobs. Many 1139158, DARPA XData Award FA8750-12-2-0331, In-
existing analyses consider placing balls into bins, and tel via the Intel Science and Technology Center for
recent work [18] has generalized this to placing multi- Cloud Computing (ISTC-CC), and gifts from Amazon
ple balls concurrently into multiple bins. This analysis Web Services, Google, SAP, Cisco, Clearstory Data,
is not appropriate for a scheduling setting, because un- Cloudera, Ericsson, Facebook, FitWave, General Elec-
like bins, worker machines process tasks to empty their tric, Hortonworks, Huawei, Microsoft, NetApp, Oracle,
queue. Other work analyzes scheduling for single tasks; Samsung, Splunk, VMware, WANdisco and Yahoo!.
parallel jobs are fundamentally different because a par-
allel job cannot complete until the last of a large number
of tasks completes. References
Straggler mitigation techniques (e.g., Dolly [2],
LATE [27], Mantri [3]) focus on variation in task ex- [1] Apache Thrift. [Link]
ecution time (rather than task wait time) and are com- org.
plementary to Sparrow. For example, Mantri launches a
task on a second machine if the first version of the task [2] G. Ananthanarayanan, A. Ghodsi, S. Shenker, and
is progressing too slowly, a technique that could easily I. Stoica. Why Let Resources Idle? Aggressive
be used by Sparrow’s distributed schedulers. Cloning of Jobs with Dolly. In HotCloud, 2012.

[3] G. Ananthanarayanan, S. Kandula, A. Greenberg,


10 Conclusion I. Stoica, Y. Lu, B. Saha, and E. Harris. Reining in
the Outliers in Map-Reduce Clusters using Mantri.
This paper presents Sparrow, a stateless decentralized In Proc. OSDI, 2010.
scheduler that provides near optimal performance using
two key techniques: batch sampling and late binding. We [4] D. Bradley, T. S. Clair, M. Farrellee, Z. Guo,
use a TPC-H workload to demonstrate that Sparrow can M. Livny, I. Sfiligoi, and T. Tannenbaum. An Up-
provide median response times within 12% of an ideal date on the Scalability Limits of the Condor Batch
scheduler and survives scheduler failures. Sparrow en- System. Journal of Physics: Conference Series,
forces popular scheduler policies, including fair sharing 331(6), 2011.

15
[5] J. Dean and L. A. Barroso. The Tail at Scale. Com- [17] K. Ousterhout, A. Panda, J. Rosen, S. Venkatara-
munications of the ACM, 56(2), February 2013. man, R. Xin, S. Ratnasamy, S. Shenker, and I. Sto-
ica. The Case for Tiny Tasks in Compute Clusters.
[6] A. Demers, S. Keshav, and S. Shenker. Analysis
In Proc. HotOS, 2013.
and Simulation of a Fair Queueing Algorithm. In
Proc. SIGCOMM, 1989. [18] G. Park. A Generalization of Multiple Choice
Balls-into-Bins. In Proc. PODC, pages 297–298,
[7] D. L. Eager, E. D. Lazowska, and J. Zahor-
2011.
jan. Adaptive Load Sharing in Homogeneous Dis-
tributed Systems. IEEE Transactions on Software [19] L. Rudolph, M. Slivkin-Allalouf, and E. Upfal. A
Engineering, 1986. Simple Load Balancing Scheme for Task Alloca-
[8] B. Hindman, A. Konwinski, M. Zaharia, A. Gh- tion in Parallel Machines. In Proc. SPAA, 1991.
odsi, A. D. Joseph, R. Katz, S. Shenker, and I. Sto- [20] M. Schwarzkopf, A. Konwinski, M. Abd-El-
ica. Mesos: A Platform For Fine-Grained Resource Malek, and J. Wilkes. Omega: flexible, scalable
Sharing in the Data Center. In Proc. NSDI, 2011. schedulers for large compute clusters. In Proc. Eu-
[9] M. Isard, V. Prabhakaran, J. Currey, U. Wieder, roSys, 2013.
K. Talwar, and A. Goldberg. Quincy: Fair Schedul- [21] B. Sharma, V. Chudnovsky, J. L. Hellerstein, R. Ri-
ing for Distributed Computing Clusters. In Proc. faat, and C. R. Das. Modeling and Synthesizing
SOSP, 2009. Task Placement Constraints in Google Compute
[10] M. A. Jette, A. B. Yoo, and M. Grondona. Clusters. In Proc. SOCC, 2011.
SLURM: Simple Linux Utility for Resource Man-
[22] D. Shue, M. J. Freedman, and A. Shaikh. Per-
agement. In Proc. Job Scheduling Strategies for
formance Isolation and Fairness for Multi-Tenant
Parallel Processing, Lecture Notes in Computer
Cloud Storage. In Proc. OSDI, 2012.
Science, pages 44–60. Springer, 2003.
[23] T. White. Hadoop: The Definitive Guide. O’Reilly
[11] M. Kornacker and J. Erickson. Cloudera Impala:
Media, 2009.
Real Time Queries in Apache Hadoop, For Real.
[Link] [24] R. S. Xin, J. Rosen, M. Zaharia, M. J. Franklin,
2012/10/cloudera-impala-real- S. Shenker, and I. Stoica. Shark: SQL and Rich
time-queries-in-apache-hadoop- Analytics at Scale. In Proc. SIGMOD, 2013.
for-real/, October 2012.
[25] M. Zaharia, D. Borthakur, J. Sen Sarma,
[12] S. Melnik, A. Gubarev, J. J. Long, G. Romer, K. Elmeleegy, S. Shenker, and I. Stoica. Delay
S. Shivakumar, M. Tolton, and T. Vassilakis. Scheduling: A Simple Technique For Achieving
Dremel: Interactive Analysis of Web-Scale Locality and Fairness in Cluster Scheduling. In
Datasets. Proc. VLDB Endow., 2010. Proc. EuroSys, 2010.
[13] M. Mitzenmacher. How Useful is Old Informa- [26] M. Zaharia, M. Chowdhury, T. Das, A. Dave,
tion? volume 11, pages 6–20, 2000. J. Ma, M. McCauley, M. J. Franklin, S. Shenker,
[14] M. Mitzenmacher. The Power of Two Choices and I. Stoica. Resilient Distributed Datasets: A
in Randomized Load Balancing. IEEE Trans- Fault-Tolerant Abstraction for In-Memory Cluster
actions on Parallel and Distributed Computing, Computing. In Proc. NSDI, 2012.
12(10):1094–1104, 2001.
[27] M. Zaharia, A. Konwinski, A. D. Joseph, R. Katz,
[15] M. Mitzenmacher. The Power of Two Random and I. Stoica. Improving MapReduce Performance
Choices: A Survey of Techniques and Results. In in Heterogeneous Environments. In Proc. OSDI,
S. Rajasekaran, P. Pardalos, J. Reif, and J. Rolim, 2008.
editors, Handbook of Randomized Computing, vol-
ume 1, pages 255–312. Springer, 2001.
[16] A. C. Murthy. The Next Generation of Apache
MapReduce. [Link]
com/blogs/hadoop/next-generation-
[Link],
February 2012.

16

You might also like