0% found this document useful (0 votes)
136 views13 pages

LinkedIn Helix

The document discusses the development of Helix, a generic cluster management framework designed to simplify the management of distributed data systems (DDSs) at LinkedIn. Helix provides a structured approach to handle common tasks such as resource management, fault tolerance, and elasticity, allowing developers to focus on system behavior rather than the complexities of distributed systems. The paper outlines the design, implementation, and performance of Helix, demonstrating its effectiveness in managing various DDSs while maintaining operational simplicity.

Uploaded by

mirage.sks
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
136 views13 pages

LinkedIn Helix

The document discusses the development of Helix, a generic cluster management framework designed to simplify the management of distributed data systems (DDSs) at LinkedIn. Helix provides a structured approach to handle common tasks such as resource management, fault tolerance, and elasticity, allowing developers to focus on system behavior rather than the complexities of distributed systems. The paper outlines the design, implementation, and performance of Helix, demonstrating its effectiveness in managing various DDSs while maintaining operational simplicity.

Uploaded by

mirage.sks
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Untangling Cluster Management with Helix

Kishore Gopalakrishna, Shi Lu, Zhen Zhang, Adam Silberstein,


Kapil Surlaker,Ramesh Subramonian, Bob Schulman

LinkedIn
Mountain View, CA, USA
{kgopalakrishna,slu,zzhang,asilberstein,ksurlaker,rsubramonian,bschulman}@[Link]

Distributed data systems systems are used in a variety 1. INTRODUCTION


of settings like online serving, offline analytics, data trans- The number and variety of distributed data systems (DDSs)
port, and search, among other use cases. They let organiza- have grown dramatically in recent years and are key infras-
tions scale out their workloads using cost-effective commod- tructure pieces in a variety of industrial and academic set-
ity hardware, while retaining key properties like fault toler- tings. These systems cover a number of use cases including
ance and scalability. At LinkedIn we have built a number of online serving, offline analytics, search, and messaging. The
such systems. A key pattern we observe is that even though appeal of such systems is clear: they let developers solve
they may serve different purposes, they tend to have a lot large-scale data management problems by letting them fo-
of common functionality, and tend to use common build- cus on business logic, rather than hard distributed systems
ing blocks in their architectures. One such building block problems like fault tolerance, scalability, elasticity, etc. And
that is just beginning to receive attention is cluster man- because these systems deploy on commodity hardware, they
agement, which addresses the complexity of handling a dy- let organizations avoid purchasing and maintaining special-
namic, large-scale system with many servers. Such systems ized hardware.
must handle software and hardware failures, setup tasks such At LinkedIn, as at other web companies, our infrastruc-
as bootstrapping data, and operational issues such as data ture stack follows the above model [14]. Our stack includes
placement, load balancing, planned upgrades, and cluster offline data storage, data transport systems Databus and
expansion. Kafka, front end data serving systems Voldemort and Espresso,
All of this shared complexity, which we see in all of our and front end search systems. We see the same model re-
systems, motivates us to build a cluster management frame- peated in a variety of other well-known systems like Hadoop [2],
work, Helix, to solve these problems once in a general way. HBase [4], Cassandra [1], MongoDB [7] and Hedwig [6].
Helix provides an abstraction for a system developer to These systems serve a variety of purposes and each pro-
separate coordination and management tasks from compo- vides unique features and tradeoffs, yet they share a lot of
nent functional tasks of a distributed system. The developer functionality. As evidence of shared functionality, we ob-
defines the system behavior via a state model that enumer- serve systems frequently reusing the same off-the-shelf com-
ates the possible states of each component, the transitions ponents. A primary example is at the storage layer of DDSs.
between those states, and constraints that govern the sys- PNUTS [11], Espresso [14], and others use MySQL as an un-
tem’s valid settings. Helix does the heavy lifting of ensuring derlying storage engine, different Google systems build on
the system satisfies that state model in the distributed set- GFS [10, 15], and HBase [4] builds on HDFS. These storage
ting, while also meeting the system’s goals on load balanc- engines are robust building blocks used here for lowest-level
ing and throttling state changes. We detail several Helix- storage, while other key features like partitioning, replica-
managed production distributed systems at LinkedIn and tion, etc. are built at higher layers.
how Helix has helped them avoid building custom manage- The storage layer, however, is by no means the only com-
ment components. We describe the Helix design and imple- ponent amenable to sharing. What has been missing from
mentation and present an experimental study that demon- LinkedIn’s and other DDS stacks is shared cluster manage-
strates its performance and functionality. ment, or an operating system for the
cloud [16]. DDSs have a number of moving parts, often
Categories and Subject Descriptors: H.3.4 [Systems carrying state. Much of a system’s complexity is devoted
and Software]: Distributed systems toward ensuring its components are meeting SLAs during
General Terms: Design, Algorithms steady state as well as performing operational tasks that
are common, but less routine. Handling failures, planned
upgrades, background tasks such as backups, coalescing of
data, updates, rollbacks, cluster expansion, and load bal-
Permission to make digital or hard copies of all or part of this work for ancing all require coordination that cannot be accomplished
personal or classroom use is granted without fee provided that copies are easily or effectively within the components themselves. Co-
not made or distributed for profit or commercial advantage and that copies
ordination is a first-class function of a DDS. By separating it
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific into a cluster management component, the functional com-
permission and/or a fee. ponents of the distributed system are simpler to build.
SOCC’12, October 14-17, 2012, San Jose, CA USA Cluster Management The term cluster management is a
Copyright 2012 ACM 978-1-4503-1761-0/12/10 ...$15.00.
broad one. We define it as the set of common tasks required optimal strategies during state changes, given the defined
to run and maintain a DDS. These tasks are: constraints. This approach lets DDS designers focus on
• Resource management: The resource (database, index, modeling system behavior, rather than the complexity of
etc.) the DDS provides must be divided among different choreographing that behavior.
nodes in the cluster. In this paper we make the following contributions:
• Fault tolerance: The DDS must continue to function • A model for generic cluster management, including both
amid node failures, including not losing data and main- defining the common components of a DDS and the func-
taining read and write availability. tionality a cluster manager must provide (Section 2.2).
• Elasticity: As workloads grow, clusters must grow to • The design and implementation of Helix. This includes
meet increased demand by adding more nodes. The its pluggable interfaces and examples of how we use it at
DDS’s resources must be redistributed appropriately across LinkedIn to support a diverse set of production systems.
the new nodes. We initially targeted a diverse set of systems: Espresso, a
serving store; Databus, a change capture and transport
• Monitoring: The cluster must be monitored, both for
system, and a Search-as-a-service system. (Sections 3
component failures that affect fault tolerance, and for
and 4).
various health metrics such as load imbalance and SLA
misses that affect performance. Monitoring requires fol- • An experimental study of Helix. We show that Helix is
lowup action, such as re-replicating lost data or re-balancing highly responsive to key events like server failure. We
data across nodes. also show how through generic cluster management we
make it possible to debug DDS correctness while remain-
We more formally define DDS cluster management tasks in
ing agnostic to the DDS’s actual purpose. (Section 5).
Section 2.2.
Despite so much common ground, there is not a large and We review related work in Section 6. We conclude and han-
mature body of work in making cluster management a stan- dle future work in Section 7.
dalone generic building block as compared to the storage
layer. That said, there is a burgeoning body of work in this 2. MOTIVATION
area (as seen in YARN [3] and other systems discussed in At LinkedIn we have built and deployed a number of
Section 6). DDSs. These DDSs share many common attributes in how
There are a number of possible reasons. In the evolution they are deployed, monitored, expanded, etc. However, these
of a DDS, the storage layer provides the primary function- systems have differences in specific optimization goals and
ality, and the capabilities required to manage and operate constraints governing those common attributes. The goal
the cluster become apparent incrementally as the system is of building a general cluster manager is very appealing for
deployed. The most expedient way to add those capabilities LinkedIn and the community. Providing a common way for
is to incorporate them into the DDS components. It is not a system to declare its own model, without worrying about
obvious in the early stages that generic cluster management how to force the system to conform to that model, makes
functionality will be required, and as those capabilities are creating that model easy and uniform. Moreover, it sim-
added, the system becomes more complex. We have found plifies operational support, since there is only one cluster
that it is possible to separate the cluster management tasks, manager executing cluster changes across different systems,
and build multiple systems using a generic, standalone clus- and we can concentrate on that manager’s correctness.
ter manager Our goal in this work is to formally identify the common
While defining and building a generic cluster management characteristics of DDSs and then build a cluster manager
system is a daunting challenge, the benefits are immense for that both handles those requirements for the DDS, while
a company like LinkedIn. We are building out many DDSs. exposing interfaces so the DDS can declaratively describe
If each of these can rely on a single cluster manager, we gain their own system-specific model of behavior. The cluster
a lot of benefit from simpler and faster development time, manager then enforces the system-specific model of behav-
as well as operational simplicity. ior.
Helix This paper presents Helix, a generic cluster man-
agement system we have developed at LinkedIn. Helix He- 2.1 Use Cases
lix drastically simplifies cluster management by providing Before delving into cluster manager requirements, it helps
a simple abstraction for describing DDSs and exposing to to review a sample of DDSs that we target. We describe
DDS a set of pluggable interfaces for declaring their correct three LinkedIn systems. We have purposefully chosen them
behavior. for their diversity, to ensure the commonalities we draw from
Key among the pluggable interfaces is an augmented fi- them are truly found in most or all DDSs. These use cases
nite state machine, which the DDS uses to encode its valid are good examples, and are also some of the earliest use
settings, i.e., its possible states and transitions, and associ- cases at LinkedIn that motivated our cluster management
ated constraints. The DDS may also specify, using an op- work.
timization module, goals for how its resources are best dis- Espresso Espresso is a distributed, timeline consistent, scal-
tributed over a cluster, i.e., how to place resources and how able, document store that supports local secondary index-
to throttle cluster transitions. Helix uses these to guide the ing and local transactions. Espresso runs on a number of
actions it executes on the various components of the DDS. storage node servers that store and index data and answer
This includes monitoring system state for correctness and queries. Espresso databases are horizontally partitioned into
performance, and executing system transitions as needed. a number of partitions, with each partition having a certain
Helix automatically ensures the DDS is always valid, both number of replicas distributed across the storage nodes.
in steady state and while executing transitions, and chooses Espresso designates one replica of each partition as mas-
ter and the rest as slaves; only one master may exist for the change log is consumed by a bank of consumers. Each
each partition at any time. Espresso enforces timeline con- databus partition is assigned to a consumer such that parti-
sistency where only the master of a partition can accept tions are evenly distributed across consumers and each par-
writes to its records, and all slaves receive and apply the tition is assigned to exactly one consumer at a time. The
same writes through a replication stream. For load balanc- set of consumers may grow over time, and consumers may
ing, both master and slave partitions are assigned evenly leave the group due to planned or unplanned outages. In
across all storage nodes. For fault tolerance, it adds the these cases, partitions must be reassigned, while maintain-
constraint that no two replicas of the same partition may be ing balance and the single consumer-per-partition invariant.
located on the same node.
To maintain high availability, Espresso must ensure ev- 2.2 Requirements
ery partition has an assigned master in the cluster. When
The above systems tackle very different use cases. As we
a storage node fails, all partitions mastered on that node
discuss how they partition their workloads and balance them
must have their mastership transferred to other nodes and,
across servers, however, it is easy to see they have a number
in particular, to nodes hosting slaves of those same parti-
of common requirements, which we explicitly list here.
tions. When this happens, the load should be as evenly
distributed as possible. • Assignment of logical resources to physical servers
Espresso is elastic; we add storage nodes as the num- Our use cases all involve taking a system’s set of logical
ber/size of databases or the request rate against them, in- resources and mapping them to physical servers. The
creases. When nodes are added we migrate partitions from logical entities can be database partitions as in Espresso,
existing nodes to new ones. The migration must maintain or a consumer as in the Databus consumption case. Note
balanced distribution in the cluster but also minimize unnec- a logical entity may or may not have state associated
essary movements. Rebalancing partitions in Espresso re- with it, and a cluster manager must be aware of any cost
quires copying and transferring significant amounts of data, associated with this state (e.g. movement cost).
and minimizing this expense is crucial. Even when mini- • Fault detection and resource reassignment All of
mized, we must throttle rebalancing so existing nodes are our use case systems must handle cluster member failures
not overwhelmed. by first detecting such failures, and second re-replicating
Skewed workloads might cause a partition, and the stor- and reassigning resources across the surviving members,
age node hosting it, to become a hot spot. Espresso must all while satisfying the system’s invariants and load bal-
execute partition moves to alleviate any hot spots. ancing goals. For example, Espresso mandates a single
Search-as-a-service LinkedIn’s Search-as-a-service lets in- master per partition, while Databus consumption man-
ternal customers define custom indexes on a chosen dataset dates a consumer must exist for every database partition.
and then makes those indexes searchable via a service API. When a server fails, the masters or consumers on that
The index service runs on a cluster of machines. The index server must be reassigned.
is broken into partitions and each partition has a configured • Elasticity Similar to failure detection and response re-
number of replicas. Each cluster server runs an instance of quirement, systems must be able to incorporate new clus-
the Sensei [8] system (an online index store) and hosts in- ter physical entities by redistributing logical resources to
dex partitions. Each new indexing service gets assigned to those entities. For example, Espresso moves partitions
a set of servers, and the partition replicas must be evenly to new storage nodes, and Databus moves database par-
distributed across those servers. titions to new consumers.
Unlike Espresso, search indexes are not directly written by
external applications, but instead subscribe to external data • Monitoring Our use cases require we monitor systems
streams such as Databus and Kafka [14] and take their writes to detect load imbalance, either because of skewed load
from them. Therefore, there are no master-slave dynamics against a system’s logical partitions (e.g., an Espresso
among the replicas of a partition; all replicas are simply hot spot), or because a physical server become degraded
offline or online. and cannot handle its expected load (e.g., via disk fail-
When indexes are bootstrapped, the search service uses ure). We must detect these conditions, e.g., by mon-
snapshots of the data source (rather than the streams) to itoring throughput or latency, and then invoke cluster
create new index partitions. Bootstrapping is expensive and transitions to respond.
the system limits the number of partitions that may concur- Reflecting back on these requirements we observe a few
rently bootstrap. key trends. They all involve encoding a system’s optimal
Databus Databus is a change data capture (CDC) sys- and minimally acceptable state, and having the ability to
tem that provides a common pipeline for transporting events respond to changes in the system to maintain the desired
from LinkedIn primary databases to caches within various state. In the subsequent sections we show how we incorpo-
applications. rate these requirements into Helix.
Databus deploys a cluster of relays that pull the change
log from multiple databases and let consumers subscribe to
the change log stream. Each Databus relay connects to 3. DESIGN
one or more database servers and hosts a certain subset
This section discusses the key aspects of Helix’s design by
of databases (and partitions) from those database servers.
which it meets the requirements introduced in Section 2.2.
Databus has the same concerns as Espresso and Search-as-
Our framework layers system-specific behavior on top of
a-service for assigning databases and partitions to relays.
generic cluster management. Helix handles the common
Databus consumers have a cluster management problem
management tasks while allowing systems to easily define
as well. For a large partitioned database (e.g. Espresso),
and plug in system-specific logic.
In order to discuss distributed data systems in a general express constraints on both states and transitions. Declaring
way we introduce some basic terminology: that a system must have one exactly one master replica per
partition is an example of a state constraint, while limiting
3.1 DDS Terminology the number of replicas that may concurrently bootstrap is
• Node: A single machine. an example of a transition constraint.
We introduce formal language for defining an AFSM. A
• Cluster : A collection of nodes, usually within a single DDS contains one or more resources. Each resource has a
data center, that operate collectively and constitute the set of partitions P , each partition pi ∈ P has a replica set
DDS. R(pi ), Rσ (pi ) is the subset of R(pi ) in state σ, and Rτ (pi )
• Resource: A logical entity defined by and whose purpose is the subset of R(pi ) undergoing transition τ . Note for
is specific to the DDS. Examples are a database, search ease-of-use we differentiate states and transitions; internally,
index, or topic/queue in a pub-sub system. Helix treats both of these as states (changing from a state
• Partition: Resources are often too large or must support to a transition is instantaneous). N is the set of all nodes,
too high a request rate to maintain them in their entirety, and R(pi , nj ) denotes the subset of R(pi ) replicas located
but instead are broken into pieces. A partition is a subset on node nj . We assume all nodes are identical, and defer
of the resource. The manner in which the resource is handling heterogeneous node types to future work.
broken is system-specific; one common approach for a We now take a particular use case, Espresso as described
database is to horizontally partition it and assign records in Section 2.2, and formally define its AFSM.
to partitions by hashing on their keys.
1. Replica states are {M aster(M ), Slave(S),
• Replica: For reliability and performance, DDSs usually Of f line(O)}
maintain multiple copies of each partition, stored on dif-
ferent nodes. Copies of the same partition are known as 2. Legal state transitions are {O → S, S → M, M →
replicas. S, S → O}
• State: The status of a partition replica in a DDS. A
3. ∀pi , |RM (pi )| ≤ 1 (Partition has at most one master.)
finite state machine defines all possible states of the sys-
tem and the transitions between them. We also consider 4. ∀pi , |RS (pi )| ≥ 2 (Partition has at least two slaves.)
a partitions’s state to be the set of states of all of its
replicas. For example, some DDSs allow for one replica 5. ∀pi , ∀nj , |R(pi , nj )| ≤ 1 (At most one replica per par-
to be a master, which accepts reads and writes, or a slave tition on each node.)
which accepts only reads.
3.2.2 Optimization Module
• Transition: A DDS-defined action specified in a finite
state machine that lets a replica move from one state to While the AFSM lets DDSs declare their correct behavior
another. at a partition-level, the optimization module lets DDSs list
optimizations at a variety of granularities: partition, node,
This list gives us a generic set of concepts that govern resource, cluster. These optimizations can be broken into
most or all DDSs; all of the DDSs we have at LinkedIn fol- two types: transition goals and placement goals. Note that
low this model. From the definitions, however, it is clear the optimization goals are just that. Helix tries to achieve
these concepts are quite different across DDSs. For exam- them, but not at the cost of cluster correctness, and correct-
ple, a partition can be as diverse as a chunk of a database ness does not depend on them.
versus a set of pub/sub consumers. The goals for how these When Helix must invoke multiple replica transitions, it
partitions should be distributed across nodes may be dif- must often choose an ordering for those transitions, both
ferent. As a second example, some systems simply require to maintain DDS correctness during the transitions, and in
a minimum number of healthy replicas, while others have case of throttling. Transition goals let the DDS tell Helix
more nuanced requirements, like master vs. slave. how to prioritize those transitions. We denote Π is a transi-
We next introduce how Helix lets DDSs define their spe- tion preference list that ranks state transitions in order from
cific cluster manager requirements, and how Helix supports highest to lowest priority.
the requirements of any DDS without custom code. Helix has many choices for how to place replicas on nodes.
3.2 Declarative System Behavior Placement goals let the DDS tell Helix how this should be
done. The DDS can use these goals, for example, to achieve
The two key aspects of Helix that provide our desired load balancing. Function σ(nj ) returns the number of repli-
“plug-and-play” capability are (1) an Augmented Finite State cas in state σ (across all partitions) on nj , and τ (nj ) returns
Machine (AFSM) that lets a system define the states and the number of replicas in transition τ on a node nj . τ (C)
transitions of its replicas and constraints on its valid be- returns the number of replicas in τ cluster-wide.
havior for each partition; and (2) an optimization module Espresso’s optimizations are as follows:
through which systems provide optimization goals that im-
pact performance, but not correctness. We now discuss these 1. ∀pi , |RM (pi )| = 1 (Partition has one master.)
in sequence.
2. ∀pi , |RS (pi )| = 2 (Partition has two slaves.)
3.2.1 AFSM
3. Π = {S → M }, {O → S}, {M → S, S → O}
We reason about DDS correctness at the partition level. A
finite state machine has sufficient expressiveness for a system 4. minimize(maxnj ∈N M (nj ))
to describe, at a partition granularity, all its valid states
and all legal state transitions. We have augmented it to also 5. minimize(maxnj ∈N S(nj ))
Algorithm 1 Helix execution algorithm
1: repeat
2: validTrans = ∅
3: inflightTrans = ∅
4: for each partition pi do
5: Read currentState
6: Compute targetState
7: Read pi pendingTrans
8: [Link](pendingTrans)
9: requiredTrans = computeTrans(currentState, tar-
getState, pendingTrans)
10: [Link](getValidTransSet(
requiredTrans))
11: end for
12: newTrans = throttleTrans(inflightTrans, validTrans)
13: until newTrans == ∅

Figure 1: Search-as-a-service state machine.


The most crucial feature of the Helix transition algorithm
6. maxnj ∈N (O → S)(nj ) ≤ 3. is that it is identical across all of these changes and across
all DDSs! Else, we face a great deal of complexity trying to
7. (O → S)(C) ≤ 10. manage so many combinations of changes and DDSs, and
lose much of Helix’s generality. The execution algorithm
Goals 1 and 2 describe Espresso’s preferences for numbers of appears in Algorithm 3.3 and we now step through it.
masters and slaves. These are nearly identical to constraints In lines 2-3 we initialize two transition sets: validTrans
3 and 4 in Espresso’s AFSM. Whereas Espresso can legally and inflightTrans, whose purposes we describe shortly. The
have 0 or 1 masters per partition, it prefers to have 1 so first step, in lines 5-6, is on a partition-by-partition basis,
that the partition is available; whereas Espresso must have to read the DDS’s current state and compute a target state,
at least 2 slaves for reliability, it prefers to have 2 and not where target state is a distribution of a partition’s replicas
more. Goal 3 tells Helix to prioritize slave-to-master tran- over cluster nodes that respects the DDS’s constraints and
sitions above all others, followed by offline-to-slave. The optimization goals. Most of the time, the current state will
transitions that move replicas offline are lowest. This pri- match the target state, and there is nothing to do; they
oritization is reasonable for Espresso, which cannot make disagree only when the cluster changes (e.g. nodes lost, par-
partitions available for writes without having a master, and titions added, etc.).
has degraded fault tolerance when it lacks slaves. The order By default we use the RUSH [13] algorithm to produce
also minimizes downtime when new nodes are added, since the target state, though with enhancements to ensure we
slaves are then created on the new nodes before they are re- meet state machine constraints and to ensure we meet our
moved from existing nodes. Goals 4 and 5 encode Espresso’s optimization goals. Default RUSH relies on random hashing
load balancing goals, which are to evenly distribute masters of a huge number of partitions to meet load balancing goals.
across all cluster nodes and to evenly distribute slaves across Since DDSs can have as few as 10s of partitions per node,
all cluster nodes. Goals 6 and 7 encode throttling goals, al- to avoid load skew and so better meet optimization goals,
lowing no more than 3 concurrent offline-to-slave transitions we additionally assign each node a budget that limits the
per node and no more than 10 concurrent offline-to-slave number of partitions it may host. Helix makes it easy to plug
transitions per cluster. in other algorithms; we discuss this more in Section 3.3.1.
Figure 1 shows the detailed state machine for Search-as- Given a partition’s target state, line 7-9 reads all pending
a-service, another Helix-managed DDS. As with transitions for the partition and then computes necessary
Espresso, the state model and constraints are configured in additional replica transitions. Given current state and the
Helix. Search-as-a-service has large number number of repli- state model, producing the set of all necessary transitions is
cas and one of the main optimization goal is to throttle the straightforward and we omit the details.
maximum number of The next part of the algorithm computes a set of valid
offline→bootstrap transitions. transitions for the partition, taking into account the already
pending transitions and those that still must occur. The
3.3 Helix Execution main objective of computing the set of valid transitions is
Thus far we have described how DDSs declare their behav- to maximize the transitions that can be done in parallel
ior within Helix. We now describe how Helix invokes that without violating the state constraints. Suppose we have T
behavior on behalf of the DDS at run-time. Helix execution possible transitions for a partition. Note that the maximum
is a matter of continually monitoring DDS state and, as nec- possible value of T is the number of replicas. If Helix issues
essary, ordering transitions on the DDS. There are a variety all these transitions in parallel, they can be executed in any
of changes that can occur within a system, both planned and order. To ensure correctness in the DDS, Helix needs to
unplanned: bootstrapping the DDS, adding or losing nodes, evaluate the system state for all T ! possible orders in which
adding or deleting resources, adding or deleting partitions, the transitions are executed. The cost of evaluating the state
among others (we discuss detecting the unplanned changes correctness for T ! permutations is exponential in the number
in Section 4). of replicas. Helix avoids the exponential time complexity by
utilizing the fact that system constraints are expressed in not performed by two nodes at any given time. The RUSH
terms of state, which means Helix needs to get the count algorithm allows us to balance the tasks without having to
for each state at every intermediate stage. In other words, reshuffle all task assignments. Databus consumer grouping
from Helix’s point of view, (Node1:Master, Node2:Slave) is uses this mode of execution.
no different from (Node2:Master, Node1:Slave), since the The second Helix mode is SEMI-AUTO. The DDSs de-
count of masters and slaves is 1 each in both cases. cide the replica placement while Helix still chooses the state
Our solution is to produce valid transition sets; for a given of those replicas. This is used where creation of additional
partition, a transition set is valid if for every possible order- replica is expensive, as is typical in DDSs that have a large
ing of its transitions, the partition remains valid. Line 10 amount of data associated with each replica. The assump-
calls a method getValidTransSet that initializes a transition tion here is that when a node fails, instead of remapping
set to include all currently pending transitions for a parti- replica placement among the remaining nodes, only the states
tion and greedily adds other required transitions, as long as of replica changes. For example, in Espresso, when a master
the transition set remains valid, until no more transitions fails, Helix promotes a slave to master instead of creating a
can be added. It considers the transitions in priority order, new replica. This ensures mastership transfer is fast, with-
according to the optimization goals given by the DDS. out negatively impacting availability of the DDS. Helix pro-
Note we compute transition sets partition-by-partition. vides a way for application to reuse the RUSH algorithm and
Since the AFSM is per-partition, we can safely execute a ensure that when a node fails, masterships are transferred
valid transition set per each partition without making any evenly among the remaining nodes. Espresso uses this mode
partition invalid. Thus, we have two potential dimensions of execution. As a second example, HBase can use semi-auto
for parallelism: by building transition sets per partition, and placement to co-locate its regions servers with the HDFS
across partitions. blocks containing those regions’ data.
By line 12 we have a set of valid transitions to run across Helix offers a third mode called CUSTOM, in which DDSs
all partitions. We do not simply execute them all, but in- completely control the placement and state of each replica.
stead now take into account the DDS’s throttling optimiza- In this case Helix does the coordination to move the DDS
tion goals. The throttleTransitions method takes the set of from its Current state to the Final state expressed by the
all inflight transitions (from prior rounds) and then selects DDS. Helix provides a special interface with which the ap-
as many additional transitions as possible, without violating plication provides such custom functionality when there are
constraints. Notice that any non-scheduled valid transitions changes in the cluster. This functionality can reside on any
are essentially forgotten and are considered anew in later of the nodes and Helix ensures that it is executed on only
rounds. one node elected as a leader in the DDS. This mode is use-
Finally, not shown in the algorithm is what happens when ful for applications if it wants to coordinate across multi-
transitions complete. We are notified by callback and re- ple resources or have additional logic to be used to decide
move those transitions from their partitions lists of pending the final state of the replicas. It is important to note that
transitions. the DDS need only express the final placement and state
The execution algorithm has two greedy key steps, to of replicas; Helix still computes the transitions needed and
produce valid transitions and to choose transitions for ex- executes them ,such that constraints are not violated. This
ecution, rather than deriving both in a single optimization. allows application to still use the other features of Helix like
While we could combine the two to produce a provably op- throttling, pluggable FSM etc.. Search-as-a-service uses this
timal solution, for the time being, we find the current ap- mode of execution. Having this feature also allows Espresso
proach produces a high level of parallelism, with highest to be configured differently based on the replication chan-
priority transitions scheduled early. nel. At LinkedIn we have deployed Espresso in production
using native MySQL replication. One of the requirements
3.3.1 Helix Modes of Execution for MySQL replication is that the replicas of all resources
hosted on any node be in the same State (Master or Slave).
By default Helix places replicas using modified RUSH,
With CUSTOM execution Helix lets Espresso control the
as described in Algorithm 3.3. While this approach makes
state of all replicas across multiple resources to change the
things very simple for some applications, it may not be pow-
state atomically.
erful enough for all applications, such as those that want to
Applications may want this varying degree of control ei-
customize placement of a single resource’s partitions or even
ther because they are very specialized or because they are
control placement of multiple resources’ partitions.
wary of handing over all control to Helix. We do not want
Helix supports 3 different execution modes which allows
to turn away such applications, and so permit customized
application to explicitly control the placement and state of
placement in Helix, while allowing them to benefit from all
the replica.
of Helix’s other features, rather than force such applications
The default execution mode is AUTO, in which Helix de-
to build their own cluster management from scratch.
cides both the placement and state of the replica. This op-
tion is useful for applications where creation of a replica is
not expensive. A typical example is evenly distributing a 3.3.2 Execution example
group of tasks among the currently alive processes. For ex- We consider a case when a node is added to already exist-
ample, if there are 60 tasks and 4 nodes, Helix assigns 15 ing cluster to illustrate the execution sequence in Helix. Sup-
tasks to each node. When one node fails Helix redistributes pose we start with an Espresso cluster with 3 nodes n0 . . . n2 ,
its 15 tasks to the remaining 3 nodes. Similarly, if a node 12 partitions p0 . . . p12 , and a replication level of 3; there are
is added, Helix re-allocates 3 tasks from each of the 4 nodes 12 total replicas per node. We then add a node n3 . Intu-
to the 5th node. Helix does the coordination of handing off itively, we want to rebalance the cluster such that every node
a task from one node to another and ensures that a task is is left with 9 replicas. In Helix execution we first compute a
target state for each pi . For 9 of 12 partitions the target state
differs from the current state. In particular, 3 partitions get
mastered on the new node and 6 partitions get slaved on
the new node. We compute the required transitions for each
partition. Suppose p10 will have its master replica moved
from n1 to n3 . n1 must execute t1 = (M → S) for p10 ,
while n3 must execute t2 = (O → S) and t3 = (S → M ).
We cannot, however, execute these transitions all in parallel
since some orderings make the system invalid. In order to
avoid p10 being mastered twice, t3 must execute only after t1
completes. Helix automatically enforces this since it issues
a transition if and only if it does not violate any of the state
constraints. There are two possible valid groupings to reach
this target state. One is to execute t1 and t2 in parallel, and
then execute t3, while the other is to execute t2, t1, and t3
sequentially. Helix chooses the first approach.
Across all partitions, we produce a set of 18 valid tran-
sitions that we can execute in parallel; however, since this
Espresso cluster prohibits more than 10 from running in par-
allel, we produce a transition set of size 10, and save all re-
maining transitions (delayed due to validity or throttling)
to later rounds. Its important to note that Helix does not
wait for all 10 transitions to complete before issuing the re-
maining transitions. Instead, as soon as the first transition
is completed, it reruns the execution algorithm and tries to
issue additional transitions.

3.4 Monitoring and Alerts


Helix provides functionality for monitoring cluster Figure 2: Helix Architecture
health, both for the sake of alerting human operators and
to inform Helix-directed transitions. For example, operators
and Helix may be want to monitor request throughputs and decay(0.5)([Link]*.reqCount) > 100
latencies, and at a number of granularities: per-server, per- The aggregator outputs a column of request counts and an
partition, per-customer, per-database, etc. These metrics alert fires for each value above 100.
help systems detect if they are meeting or missing latency Alerts support simple stats enumeration and aggregation
SLAs, detect imbalanced load among servers (and trigger using pipeline operators. Each operator expects tuples with
corrective action), and even detect failing components. 1 or more input columns and outputs tuples with 1 or more
Ultimately, the DDS, rather than Helix, should choose columns (valid number of input columns is operator spe-
what to montitor. Helix provides a generic framework for cific). An example aggregating alert is:
monitoring and alerting. The DDS submits statistics and decay(0.5)([Link]*.failureCount,
alert expressions they want monitored to Helix and then [Link]*.reqCount)|SUMEACH|DIVIDE) > 100
at run-time emit statistics to Helix that match those ex- This alert sums failure counts across partitions, sums re-
pressions. Helix is oblivious to the semantic meaning of the quest counts across all partitions, and divides to generate a
statistics, yet receives, stores, and aggregates them, and fires database-wide failure rate.
any alerts that trigger. In this paper we do not fully spec- The sets aggregator and operator types within Helix are
ify the framework, but instead give a few examples of its themselves easy to expand. The only requirement in adding
expressiveness and how it is used. an aggregator is that pipeline operators and comparators
Helix stats follow the format (aggregateType)(statName). (”>” in above example) must be able to interpret its out-
For example, a system might create put. Pieline operators must specify their required number
window(5)(db*.partition*.reqCount) of input and output columns. This lets users apply arbi-
We use aggregation types to specify how stats should be trary operator chains, since Helix can determine in advance
maintained over time. In this case, the aggregation type is whether those chains are valid. While have implemented a
a window of the last 5 reported values. This stat uses wild- number of operators, the most commonly used for simple ag-
cards to tell Helix to track request count for every partition gregations are SUMEACH, which sums each input column
of every db. We also provide aggregate types accumulate, to produce an equal number of singleton columns, and SUM,
which sums all values reported over time and decay, which which does row-wise summing.
maintains a decaying sum.
Helix can maintain stats for their own sake; they are also
a building block for alerts. An example alert is: 4. HELIX ARCHITECTURE
decay(0.5)([Link]) > 100 Section 3 describes how Helix lets a DDS specify their
Helix fires an alert if the time decaying sum of request counts model of operation and then executes against it. This sec-
on dbFoo’s partition 10 exceeds 100. Similarly, we can in- tion presents the Helix system architecture that implements
statiate multiple alerts at once using wildcards: these concepts.
4.1 Helix Roles By storing all the controller’s state in Zookeeper, we make
Helix implements three roles, and each component of a the controller itself stateless and therefore simple to replace
DDS takes on at least one of them. The controller is a pure on a failure.
Helix component. It hosts the state machine engine, runs We also leverage Zookeeper to construct the reliable com-
the execution algorithm, and issues transitions against the munication channel between controller and participants. The
DDS. It is the core component of Helix. channel is modeled as a queue in Zookeeper and the con-
The DDS’s nodes are participants, but run the Helix li- troller and participants act as producers and consumers on
brary. The library is responsible for invoking callbacks when- this queue. Producers can can send multiple messages
ever the controller initiates a state transition on the partic- through the queue and consumers can process the messages
ipant. The DDS is responsible for implementing those call- in parallel. This channel brings side operational benefits like
backs. In this way, Helix remains oblivious to the semantic the ability to cancel transitions and to send other command
meaning of a state transition, or how to actually implement messages between nodes.
the transition. For example, in a LeaderStandby model, the Figure 2 illustrates the Helix architecture and brings to-
DDS implements methods OnBecomeLeaderFromStandby and gether the different components and how they are repre-
OnBecomeStandbyFromLeader. sented in Zookeeper. The diagram is from the controller’s
The spectator role is for DDS components that need to perspective. The AFSM is specified by each DDS, but is
observe system state. Spectators get notification anytime a then itself stored in Zookeeper. We also maintain the current
transition occurs. A typical spectator is the router compo- states of all partition replicas and target states for partition
nent that appears in data stores like replicas in Zookeeper. Recall that any differences between
Espresso. The routers must know how partitions are dis- these states trigger the controller to invoke state transitions.
tributed in order to direct client requests and so must be These transitions are written to the messaging queue for ex-
up-to-date about any partition moves. The routers do not ecution by the participants. Finally, Helix maintains a list
store any partitions themselves, however, and the controller of participants and spectators in Zookeeper as ephemeral
never executes transitions against them. nodes that heartbeat with Zookeeper. If any of the nodes
The roles are at a logical level. A DDS may contain mul- die, Zookeeper notifies the controller so it can take corrective
tiple instances of each role, and they can be run on separate action.
physical components, run in different processes on the same
component, or even combined into the same process.
4.3 DDS Integration with Helix
Dividing responsibility among the components brings a This section describes how a DDS deploys with Helix. The
number of key advantages. (A) All global state is managed DDS must provide 3 things.
by the controller, and so the DDS can focus on implementing Define Cluster The DDS defines the physical cluster and
its physical nodes. Helix provides an admin API, illustrated
only the local transition logic. (B) The participant need not here:
actually know the state model the controller is using to man-
age its DDS, as long as all the transitions required by that helix-admin --addCluster EspressoCluster
state model are implemented. For example, we can move helix-admin --addNode EspressoCluster <esp10:1234>
a DDS from a MasterSlave model to a read-only SlaveOnly Define Resource, State Model and Constraints Once
model with no changes to participants. The participants will the physical cluster is established, the next step is to logi-
simply never be asked to execute slave→master transitions. cally add the DDS, including the state model that defines
(C) Having a central decision maker avoids the complexity it and the resources it will supply. Again, Helix provides an
of having multiple components come to consensus on their admin API:
roles.
helix-admin --addStateModel MasterSlave states=<M,S,O>
4.2 Connecting components with Zookeeper legal_transitions=<O-S, S-M, M-S, S-0>
constraints="count(M)<=1 count(S)<=3"
Given the three Helix components, we now describe their
helix-admin --addResource
implementation and how they interact. clusterName=EspressoCluster
Zookeeper [9] plays an integral role in this aspect of Helix. resourceName=EspressoDB numPartitions=8
The controller needs to determine the current state of the replica=3
DDS and detect changes to that state, e.g., node failures. stateModelName=MasterSlave
We can either build this functionality into the controller di-
Given the above admin commands, Helix itself computes
rectly or rely on an external component that itself persists an initial target state for the resource:
state and notifies on changes. When the controller chooses
state transitions to execute, it must reliably communicate {
these to participants. Once complete, the participants must "id" : "EspressoDB",
"simpleFields" : {
communicate their status back to the controller. Again, we
"IDEAL_STATE_MODE" : "AUTO",
can either build a custom communication channel between "NUM_PARTITIONS" : "8",
the controller and participants, or rely on an external sys- "REPLICAS" : "1",
tem. The controller cannot be a single point of failure; in "STATE_MODEL_DEF_REF" : "MasterSlave",
particular, when the controller fails, we cannot afford to lose },
either controller functionality or the state it manages. "mapFields" : {
"EspressoDB_0" : { "node0" : "MASTER",
Helix relies on Zookeeper to meet all of these require-
"node1" : "SLAVE" },
ments. We utilize Zookeeper’s group membership and change "EspressoDB_1" : { "node0" : "MASTER",
notification to detect DDS state changes. Zookeeper is de- "node1" : "SLAVE" },
signed to maintain system state, and is itself fault tolerant. "EspressoDB_2" : { "node0" : "SLAVE",
"node1" : "MASTER" }, auto replica placement offers, and so we added semi-auto
"EspressoDB_3" : { "node0" : "SLAVE", and custom modes. We also added a zone concept to support
"node1" : "MASTER" }, grouping a set of nodes together. This is useful, for example,
}
}
in rack-aware replica placement strategies.
The first Helix version did not include constraints and
Implement Callback Handlers The final task for a DDS goals on transitions, but instead satisfied this requirement
is to implement logic for each state transition encoded in by overloading state constraints and goals. This complicated
their state machine. We give a partial example of handler the algorithms for computing and throttling transitions. We
prototypes for Espresso’s use of MasterSlave. added support for throttling transitions at different cluster
granularities to remove this complexity.
EspressoStateModel extends StateModel DDS debugging and correctness checking One of the
{
void offlineToSlave(Message task,
most interesting benefits of Helix falls out of its ability to
NotificationContext context) manage a cluster without specific knowledge of the DDS’s
{ functionality by allowing the DDS to express them in the
// DDS Logic for state transition form of AFSM and constraints and optimization goals. We
} have built a debugging tool capable to analyzing a DDS’s
void slaveToMaster(Message task, correctness. Testing distributed systems is extremely dif-
NotificationContext context)
{
ficult and laborious, and it cannot be done with simple
//DDS logic for state transition unit tests. Generic debugging tools that function in the
} distributed setting are especially valuable.
} Our tool uses the ”instrument, simulate, analyze” method-
ology. This consists of
4.4 Scaling Helix into a Service • Instrument Insert probes that provide information about
The risks of having a single controller as we have described the state of the system over time. We use Zookeeper logs
so far is that it can become a bottleneck or a single point as ”probes.”
of failure (even if it is easy to replace). We now discuss our
• Simulate We use robots to create cluster error, killing
approach for distributing the controller that actually lets
components and injecting faults, both on the DDS and
Helix provide cluster management as a service.
Helix sides.
To avoid making the controller a single point of failure,
we deploy multiple controllers. A cluster, however, should • Analyze We parse the logs into a relational format, load
be managed by only one controller at a time. This itself them into a database as a set of tables and then write
can easily be expressed using a LeaderStandby state model invariants on the DDS’s constraints. Violation of the
with the constraint that every cluster must have exactly one constraints signals a possible bug.
controller managing it! Thus we set up multiple controllers Every DDS that uses Helix automatically benefits from
as a participants of a supercluster comprised of the different this tool and it has been crucial in improving performance
clusters themselves as the resource. One of the controllers and debugging production issues. One of the most difficult
manages this supercluster. If that controller fails, another parts of debugging a DDS is understanding the sequence of
controller gets selected as the Leader for the supercluster. events that leads to failure. Helix logging plus the debugging
This guarantees that each cluster has exactly one controller. tool make it easy to collect system information and recon-
The supercluster drives home Helix’s ability to manage struct the exact sequence of events that lead to the failure.
any DDS, in this case itself. Helix itself becomes a scalable As a concrete example, the experiment on planned down-
DDS that in turns manages multiple DDSs. In practice we time we describe in Section 5.2.4 uses this tool to measure
typically deploy a set of 3 controllers capable of managing total downtime and partition availability during software
over 50 clusters. upgrades. This lets us experiment with different upgrade
policies and ensure they are performant and valid.
5. EVALUATION
This section describes our experience running Helix in pro- 5.2 Experiments
duction at LinkedIn and then presents experiments demon- Our experiments draw from a combination of production
strating Helix’s functionality and production. and experimental environments, using the DDS Espresso.
The clusters are built from production-class commodity
5.1 Helix at LinkedIn servers. The details of the production clusters are confi-
From the beginning we built Helix for general purpose use, dential, but we present details on our experimental cluster
targeting Espresso, Search-as-a-service and Databus. As throughout our experiment discussion.
mentioned earlier, we chose these systems to ensure Helix be-
came a truly general cluster manager and not, say, a cluster 5.2.1 Bootstrap
manager for distributed databases. The systems were them- Our first experiment examines Helix’s performance when
selves under development and in need of cluster management bootstrapping an Espresso cluster, (i.e., adding the initial
functionality, and so were likewise attracted to Helix. At the nodes and databases). We note that while for some DDSs
time of this writing, all three of Espresso, Search-as-a-service bootstrapping is a rare operation, for others, like Databus
and Databus run in production supported by Helix. consumers, it happens frequently, and fast bootstrap times
With every Helix-DDS integration we found and repaired are crucial.
gaps in the Helix design, while maintaining its generality. We vary the number of allowed parallel transitions, add
For example, Search-as-a-service needed more control than nodes and an Espresso database, and then measure the time
250 4.0
Partitions 3.5 Partitions
200 1024 3.0 150
Bootstrap Time (s)

Failover Time (s)


2400 2.5
750
150
3600
2.0
100 1.5
1.0
50
0.5
0 0.0
6 24 30 48 120 150 240 600 1200 1 5 25 50
Parallellism Parallellism

Figure 3: Espresso transition parallelism vs. boot- Figure 4: Espresso transition parallelism vs.
strap time. failover time.

taken for the cluster to reach a stable state. We repeat the 700
Partition Type

Average Partitions Per Node


experiment for 1024 partition and 2400 partition databases. 600
Figure 3 plots the result and shows that bootstrap time de- Slave
creases with greater transition parallelism, though the ben- 500 Master
efits diminish once parallelism is in the tens of transitions. 400
The total bootstrap time is directly proportional to the
number of transitions required and inversely proportional to 300
the allowed parallelism. The number of required transitions 200
is a function of the number of partitions, number of repli-
cas per partition and the transitions drawn from Espresso’s 100
state machine. For example, an Espresso database with p 0
partitions and r replicas requires pr offline→slave transi- Initial Rebalance 1 Rebalance 2
(N=20) (N=25) (N=30)
tions and p slave→master transitions. Espresso sets allowed
parallelism based on the number of cluster nodes and the
number of concurrent transitions a node can handle. The Figure 5: Elastic Cluster Expansion
final factor in deriving bootstrap time is the time required
to execute each individual transition. These times are DDS
dependent. For example, when an Espresso partition enters We vary the the number of Espresso partitions and the maxi-
the slave state Espresso creates a MySQL database and all mum cluster-wide allowed parallel transitions. Figure 4 plots
required tables for it; this takes tens of milliseconds. On the result and shows most importantly that we achieve low
bootstrap, this transition takes approximately 100 ms. unavailability in the 100s of ms, and that this time decreases
Although it is obvious that improving the parallelism in as we allow for more parallel transitions. It is interesting to
transitions will improve the time to bootstrap, overhead note that earlier versions of Helix had recovery times of mul-
added by Helix is proportional to the number of transitions tiple seconds. The reason is that when so many partitions
that can be executed in parallel. Through experiments, we transitioned from slave to master at once, Helix actually
found the majority of this time is spent updating cluster bottlenecked trying to record these transitions in Zookeeper.
metadata in Zookeeper. Helix employs many techniques to We ultimately solved that bottleneck using techniques like
minimize this time,like asynchronous writes to Zookeeper, group commit and asynchronous reads/writes; the key point
restricting writes to a Zookeeper node to a single process is we solved this problem once within Helix for every DDS
to avoid conflicts, and a group commit protocol to allow and masked internal Zookeeper details from them.
multiple updates to one Zookeeper.
The graphs depict that even as we increase the number of
partitions, by increasing parallelism, Helix is able to close
5.2.3 Elastic Cluster Expansion
the gap in bootstrap time. The gap is larger at low paral- Another of Helix’s key requirements is its ability to incor-
lelism because the benefit from techniques like group commit porate new nodes as clusters expand. Recall Algorithm 3.3
kick in when parallelism is high. We also see that increas- handles moving partitions from existing to new nodes by
ing parallelism eventually does not further lower bootstrap computing target states. We want to ensure rebalancing
time. moves as few replicas as possible; in particular, in Espresso,
we want to ensure that the number of replicas moved is min-
imized, since these involve transitions that copy data from
5.2.2 Failure Detection node to another which is expensive. We run an experiment
One of Helix’s key requirements is failure detection, and that starts with a cluster with 20 nodes, 4096 partitions and
its performance in this area must match what a DDS might 3 replicas per partition. We then add 5 new nodes, Helix
achieve for itself with a custom solution. In Espresso, losing recalculates the placement and we find that 19.6% of par-
a node means all partitions mastered on that node become titions are moved. The optimal percentage moved for full
unavailable for write, until Helix detects the failure and pro- balance is clearly 20%; with this experiment’s server and
motes slave replicas on other nodes to become masters. We partition counts, achieving 20% movement requires moving
run an experiment with an Espresso cluster of size 6, ran- partition fractions and so we instead achieve near balance.
domly kill a single node, and measure write unavailability. Finally, we add an additional 5 nodes. After Helix recalcu-
2000 3.0

Average Per-Partition Unavailability (s)


1800
Live Partitions

2.5
1600 Concurrently Migrated
1400 25
50 2.0
1200
100
1000 1.5
0 1 2 3 4 5
Time (s)
1.0

Figure 6: Percentage of partitions that are available 0.5


over the course of a planned software upgrade.
0.0
20 40 60 80 100
Percent of Partitions Migrated Concurrently
lates the placement we see that 16% of partitions are moved,
close to the optimal solution. Figure 5 shows we are minimal
Figure 7: Average unavailability per partition dur-
in the number of partitions moved to ensure even load dis-
ing planned software upgrade.
tribution. While RUSH is very good at accomplishing this
type of balancing, RUSH does not inherently support the
concept of different replica states and is by default designed
for handling numbers of partitions that are orders of mag- this experiment we also see cases where throttling transitions
nitude greater than the number of nodes. As discussed in is important to avoid load spikes that also affect availability.
Section 3 we modified it to handle different states and pos- For example, bringing up too many Espresso partitions on-
sibly much smaller numbers of partitions and still achieve line at the same time, when they have cold caches, will cause
close to minimal movement. failures. By controlling transitions at the partition granu-
larity, Helix helps DDSs avoid availability and performance
5.2.4 Planned upgrade problems, and manage planned downtime duration.
One function that Helix manages for DDSs is planned
downtime, which helps for a variety of administrative tasks, 5.2.5 Alerts
such as server upgrades. In Espresso bringing down a node Recall from Section 3 that Helix provides DDSs a frame-
involves moving every partition on it to the offline state and work for defining monitored statistics and alerts. Here we
transitioning partitions on other servers to compensate. In give some examples of alerts Espresso runs in production.
fact, there is a direct tradeoff between total time to com- Espresso monitors request latency to ensure it is meeting
plete planned downtime (e.g., to upgrade all nodes) versus its SLAs or take corrective action otherwise. Specifically it
accumulated unavailability over that time. registers the alert
Since Helix provides fault tolerance, a DDS system can decay(1)(node*.latency99th) > 100ms
upgrade its software by upgrading one node at a time. The Since the alert is wildcarded on node name, we separately
goal of this exercise is to choose a reasonable tradeoff be- monitor 99th percentile latency on each node in the Espresso
tween minimizing total upgrade time and minimizing un- cluster. The decaying sum setting of 1 means we moni-
availability during the upgrade. One simple approach is to tor the latest reported value from each node, and ignore
upgrade one node at a time. Though this approach is com- older values. Figure 8 shows a production graph displaying
mon, it may in reality reduce the availability of the system. the current monitored value for a particular Espresso node.
In this experiment we showcase the impact of concurrent The x and y axes are wall time and request latency, respec-
partition migration on a single node. We fix the total num- tively. A similar graph (not shown here) indicates whether
ber of migrations at 1000 and control the batch size of par- the alert is actually firing and, when fired, emails notifica-
titions that may be concurrently migrated. tion to Espresso operators.
What we see in this experiment is when 100 percent par- A second alert monitors the cluster-wide wide request er-
titions are migrated concurrently, the variation in downtime ror count:
per partition is very high compared to the variation when decay(1)(node*.errorcount)|SUMEACH > 50ms
25 percent of partitions are migrated. The unavailability is The alert is again wildcarded on node name, but now com-
measured as the sum total of unavailability for each parti- putes the sum of errors over all nodes. Figure 9 shows the
tion. The ideal percentage of partitions to choose for up- production graph and, again, there is an unshown corre-
grade is directly proportional to the maximum transitions sponding graph displaying the alert status.
that can concurrently occur on a single node.
Figure 6 shows that by choosing a lower concurrency level,
each migration is faster but the overall time taken to upgrade 6. RELATED WORK
a node is larger. The problems that systems dealing with cluster manage-
Figure 7 shows a different perspective by plotting the av- ment attempt to solve fall in 3 main categories: (a) resource
erage per partition unavailability. It also shows that the management, (b) enforcing correct system behavior in the
variance grows quite high as parallelism increases. While presence of faults and other changes, and (c) monitoring.
lower upgrade times are appealing, higher error counts dur- There are generic systems that solve resource manage-
ing upgrades are not. ment and monitoring, but not fault-tolerance, in a generic
The ability to balance downtime and unavailability is im- manner. Other DDSs support fault-tolerance and enforce
portant for DDSs so the upgrade process becomes smooth system behavior correctness in a way that is specific to that
and predictable. Beyond what is illustrated in the scope of particular DDS.
Figure 8: Alert reporting 99th percentile latency Figure 9: Alert reporting the cluster-wide Espresso
on a single Espresso node. error count.

6.1 Generic Cluster Management Systems uses; Espresso gets it for free from Helix, while MongoDB
Systems such as YARN [3] and Mesos [5] implement re- had to build it.
source management and some monitoring capabilities in a Other distributed systems implement very similar cluster
generic way. Applications using these systems are responsi- management capabilities, but each system reimplements it
ble for implementing the specific logic to react to changes in for itself, and not in a way that can be easily reused by other
the cluster and enforce correct system behavior. We detail systems.
Mesos an example.
Mesos has 3 components: slave daemons that run on each 6.3 Domain Specific Distributed Systems
cluster node, a master daemon that manages the slave dae- Hadoop/MapReduce [2, 12] provides a programming model
mons, and applications that run tasks on the slaves. The and system for processing large data sets. It has been wildly
master daemon is responsible for allocating resources (cpu, successful, largely because it lets programmers focus on their
ram, disk) across the different applications. It also provides jobs’ logic, while masking the details of the distributed set-
a pluggable policy to allow for adding of new allocation mod- ting on which their jobs run. Each MR job consists of mul-
ules. An application running on Mesos consists of a sched- tiple map and reduce jobs and the processing slots get as-
uler and an executor. The scheduler registers with the mas- signed to these jobs. Hadoop takes care of monitoring the
ter daemon to request resources. The executor process runs tasks and restarting tasks in case of failures. Hadoop, then,
on the slave nodes to run the application tasks. The mas- very successfully solves the problems we outline for cluster
ter daemon determines how many resources to offer to each management, but only in the MR context. Helix aims to
application. The application scheduler is responsible for de- bring this ease of programming to DDS development.
ciding which offered resources to use. Mesos then launches Zookeeper [9] is a centralized service for maintaining con-
the tasks on the corresponding slaves. Unlike Helix, Mesos figuration information, naming, providing distributed syn-
does not provide a declarative way for applications to define chronization, and providing group services. It is used by
their behavior and constraints on it and have it enforced au- many distributed systems to help implement cluster man-
tomatically, but provides resource allocation functionality agement functionality, and we of course use it heavily as
that is complementary to Helix. a building block in Helix. It is important to note that
6.2 Custom Cluster Management Zookeeper alone does not solve the cluster management prob-
lems. For example, it provides functionality to notify a clus-
As mentioned before, distributed systems like PNUTS [11], ter manager when a node has died, but does not plan and
Hbase [4], HDFS [2], and MongoDB [7] implement cluster execute a response.
management in those specific systems. As an example, we
examine MongoDB’s cluster management.
MongoDB provides the concept of document collections 6.4 Distributed resource management
which are sharded into multiple chunks and assigned to A number of older systems pioneered solutions in dis-
servers in a cluster. MongoDB uses range partitioning and tributed resource management. Amoeba [?], Globus [?],
splits chunks when chunks grow too large. When the load GLUnix [?] and Legion [?] all manage large numbers of
of any node gets too large, the cluster must be rebalanced servers and present them as a single, shared resource to
by adding more nodes to the cluster and some chunks move users. The challenges they tackle most relevant to Helix in-
to the new nodes. clude placing resources, scheduling jobs, monitoring jobs for
Like any other distributed system, MongoDB must pro- failures, and reliably disseminating messages among nodes.
vide failover capability and ensure that a logical shard is Much of the scheduling work relates closely to how we dis-
always online. To do this, MongoDB assigns n servers (typi- tribute resource partitions among nodes, and how we hope
cally 2-3) to a shard. These servers are part of a replica set. to make this distribution more dynamic in the future, where
Replica sets are a form of asynchronous master-slave repli- Helix will move partitions aggressively in response to load
cation and consist of a primary node and slave nodes that imbalance. On the other hand, we address failure monitor-
replicate from the primary. When the primary fails, one of ing in Helix with Zookeeper, which itself relates closely to
the slaves is elected as the primary. This approach to fault these systems. In general, it is easy to see the lineage of
tolerance is similar to the MasterSlave state model Espresso these systems reflected in Helix.
7. CONCLUSION 9. REFERENCES
At LinkedIn we have built a sophisticated infrastructure [1] Apache Cassandra. [Link]
stack of DDSs, including offline data storage, data transport, [2] Apache Hadoop. [Link]
data serving and search systems. This experience has put us [3] Apache Hadoop NextGen MapReduce (YARN).
in great position to observe the complex, but common, clus- [Link]
ter management tasks that pervade all of our DDSs. This [4] Apache HBase. [Link]
motivated us to build Helix. We designed Helix with a di- [5] Apache Mesos. [Link]
verse set of DDSs in mind to ensure its generality.
[6] Hedwig.
Helix lets DDSs declare their behavior through a set of
[Link]
pluggable interfaces. Chief among these interfaces are a
state machine that lets the DDS declare the possible states [7] MongoDB. [Link]
of their partitions and transitions between them, and con- [8] SenseiDB. [Link]
straints that must met for each partition. In this way, DDS [9] Zookeeper. [Link]
designers concentrate on the logic of the system, while the [10] F. Chang et al. Bigtable: A distributed storage system
Helix execution algorithm carries out transitions in the dis- for structured data. In OSDI, 2006.
tributed setting. [11] B. F. Cooper et al. PNUTS: Yahoo!’s hosted data
We have had a lot of success with Helix at LinkedIn. By serving platform. In VLDB, 2008.
providing the performance and functionality a DDS would [12] J. Dean and S. Ghemawat. MapReduce: Simplified
normally target for itself, we have indeed offloaded cluster data processing on large clusters. In OSDI, 2004.
manager work from a number of DDSs. Helix even lets these [13] R. Honicky and E. Miller. Replication under scalable
DDSs make what would normally be drastic changes to the hashing: A family of algorithms for scalable
way they are managed with just minor changes to the Helix decentralized data distribution. In IPDPS, 2004.
state model. [14] LinkedIn Data Infrastructure Team. Data
We have a number of future directions for Helix. One of infrastructure at LinkedIn. In ICDE, 2012.
them is to simply increase its adoption, a goal we expect our [15] J. Shute et al. F1-the fault-tolerant distributed rdbms
open-source release to accelerate. A second goal is to man- supporting google’s ad business. In SIGMOD, 2012.
age more complex types of clusters. The first challenge is to
[16] M. Zaharia et al. The datacenter needs an operating
handle heterogeneous node types; we plan to approach this
system. In HotCloud, 2011.
with the notion of node groups. We cluster nodes by capabil-
ities (cpu, disk capacity, etc.) and weight more performant
groups to host larger numbers of partitions. The second
challenge is to manage clusters over increasingly complex
network topologies, including those that span multiple data
centers.
A third goal is to push more load balancing responsibility
into Helix. Helix’s alerting framework lets it observe imbal-
ance, and we must continue to extend the Helix execution
algorithm to respond to imbalance, yet remain completely
generic across DDSs.
Another feature on the roadmap is to manage clusters
that span across two or more geographical locations. Re-
lated to this is the requirement to manage up to a billion
resources. This enables the DDSs to perform selective data
placement and replication, handle data center failures, and
route requests based on geographical affinity.

8. ACKNOWLEDGMENTS
In addition to the Helix team, many other members of
the Linkedin Data Infrastructure team helped significantly
in the development of Helix. The initial work on Helix came
out of the Espresso project. The Espresso team and, in
particular, Tom Quiggle, Shirshanka Das, Lin Qiao, Swa-
roop Jagadish and Aditya Auradkar worked closely with
us defining requirements and shaping the design. Chav-
dar Botev, Phanindra Ganti and Boris Shkolnik from the
Databus team and Rahul Aggarwal, Alejandro Perez, Diego
Buthay, Lawrence Tim, Santiago Perez-Gonzalez from the
Search team were early adopters and helped solidify the
early versions of Helix. Kevin Krawez, Zachary White, Todd
Hendricks and David DeMaagd for Operations. David Zhang,
Cuong Tran, Wai Ip were instrumental in stress testing Helix
and improving the quality and performance.

You might also like