CS 3700
Networks and Distributed
Systems
Distributed Consensus and Fault
Tolerance
(or, why cant we all just get along?)
Black Box Online Services
Black Box Service
Storing and retrieving data from online services is
commonplace
We tend to treat these services as black boxes
Data goes in, we assume outputs are correct
We have no idea how the service is implemented
Black Box Online Services
debit_transaction(-$75)
OK
get_recent_transactions()
[, -$75, ]
Storing and retrieving data from online services is
commonplace
We tend to treat these services as black boxes
Data goes in, we assume outputs are correct
We have no idea how the service is implemented
Black Box Online Services
add_item_to_cart(Cheerio
s)
OK
get_cart()
[Lucky Charms,
Cheerios]
Storing and retrieving data from online services is
commonplace
We tend to treat these services as black boxes
Data goes in, we assume outputs are correct
We have no idea how the service is implemented
Black Box Online Services
post_update(I LOLed)
OK
get_newsfeed()
[, {txt: I LOLed,
likes: 87}]
Storing and retrieving data from online services is
commonplace
We tend to treat these services as black boxes
Data goes in, we assume outputs are correct
We have no idea how the service is implemented
Peeling Back the Curtain
Black Box Service
How are large services implemented?
Different types of services may have different requirements
Leads to different design decisions
Centralization
debit_transaction(-$75)
OK
Bob
get_account_balance()
$225
Bob:
Bob:
$300
$225
Advantages of centralization
Easy to setup and deploy
Consistency is guaranteed (assuming correct software implementation)
Shortcomings
No load balancing
Single point of failure
Sharding
debit_account(-$75)
<A-M>
Bob:
Bob:
$300
$225
OK
Bob
get_account_balance()
<N-Z>
$225
Advantages of sharding
Better load balancing
If done intelligently, may allow incremental scalability
Shortcomings
Failures are still devastating
Replication
debit_account(-$75)
100%
Agreeme
nt
OK
Bob
get_account_balance()
$225
Advantages of replication
<A-M>
Bob:
Bob:
$300
$225
<A-M>
<A-M>
Bob:
Bob:
$300
$225
Bob:
Bob:
$300
$225
Better load balancing of reads (potentially)
Resilience against failure; high availability (with some caveats)
Shortcomings
How do we maintain consistency?
Consistency Failures
No
ACK
No
Agreeme
nt
Asynchronous
networks are
problematic
Leader cannot disambiguate
cases where requests and
responses are lost
Bob:
$300
Bob:
Bob:
$300
$225
Bob:
Bob:
$300
$225
No
ACK
Too few
replicas?
Bob:
$300
Bob:
Bob:
$300
$225
Timeout
!
Bob:
Bob:
$300
$225
Bob:
Bob:
$300
$225
No
Agreeme
nt
Bob:
Bob:
$300
$225
Byzantine Failures
Bob:
$300
No
Agreeme
nt
Bob:
Bob:
$300
$1000
In some cases,
replicas may be
buggy or
malicious
When discussing Distributed Systems, failures due to
malice are known as Byzantine Failures
Name comes from the Byzantine generals problem
More on this later
Problem and Definitions
Build a distributed system that meets the following
goals:
The system should be able to reach consensus
Consensus [n]: general agreement
The system should be consistent
Data should be correct; no integrity violations
The system should be highly available
Data should be accessible even in the face of arbitrary failures
Challenges:
Many, many different failure modes
Theory tells us that these goals are impossible to achieve
(more on this later)
Distributed Commits (2PC and 3PC)
Theory (FLP and CAP)
Quorums (Paxos)
Forcing Consistency
debit_account(-$75)
OK
Bob:
Bob:
$300
$225
Bob:
Bob:
Bob:
$300
$225
$175
debit_account(-$50)
Bob
Error
Bob:
Bob:
Bob:
$300
$225
$175
One approach building distributed systems is to force them to be
consistent
Guarantee that all replicas receive an update
Or none of them do
If consistency is guaranteed, then reaching consensus is trivial
Distributed Commit Problem
Application that performs operations on multiple replicas
or databases
We want to guarantee that all replicas get updated, or none do
Distributed commit problem:
1. Operation is committed when all participants can perform the
action
2. Once a commit decision is reached, all participants must
perform the action
Two steps gives rise to the Two Phase Commit protocol
Motivating Transactions
transfer_money(Alice, Bob, $100)
debit_account(Alice, -$100)
Error
OK
debit_account(Bob, $100)
OK
Error
Alice:
Alice:
$600
$500
Bob:
Bob:
$300
$400
System becomes inconsistent if any individual action
fails
Simple Transactions
transfer_money(Alice, Bob, $100)
begin_transaction()
debit_account(Alice, -$100)
debit_account(Bob, $100)
At this point, if
there havent
been any errors,
we say the
transaction is
committed
end_transaction()
Alice:
Alice:
$600
$500
Bob:
Bob:
$300
$400
Alice:
$500
Bob:
$400
OK
Actions inside a transaction behave as a single action
Simple Transactions
transfer_money(Alice, Bob, $100)
begin_transaction()
debit_account(Alice, -$100)
debit_account(Bob, $100)
Error
Alice:
$600
Bob:
$300
Alice:
$500
If any individual action fails, the whole transaction fails
Failed transactions have no side effects
Incomplete results during transactions are hidden
ACID Properties
Traditional transactional databases support the
following:
1. Atomicity: all or none; if transaction fails then no changes
are applied to the database
2. Consistency: there are no violations of database integrity
3. Isolation: partial results from incomplete transactions are
hidden
4. Durability: the effects of committed transactions are
permanent
Two Phase Commits (2PC)
Well known techniques used to implement transactions
in centralized databases
E.g. journaling (append-only logs)
Out of scope for this class (take a database class, or CS 5600)
Two Phase Commit (2PC) is a protocol for implementing
transactions in a distributed setting
Protocol operates in rounds
Assume we have leader or coordinator that manages
transactions
Each replica states that it is ready to commit
Leader decides the outcome and instructs replicas to commit
or abort
Assume no byzantine faults (i.e. nobody is malicious)
2PC Example
At this point, all
replicas are
guaranteed to
be up-to-date
txid = 678; value
=y
Replica 1
Replica 2
Replica 3
x y
x y
x y
ready txid =
678
Time
Begin by
distributing the
update
Txid is a logical
clock
Wait to receive
ready to
commit from all
replicas
Tell replicas to
commit
Leader
commit txid =
678
committed txid =
678
Failure Modes
Replica Failure
Before or during the initial promise phase
Before or during the commit
Leader Failure
Before receiving all promises
Before or during sending commits
Before receiving all committed messages
Replica Failure (1)
Leader
Replica 1
Replica 2
Replica 3
x y
x y
txid = 678; value
=y
happens if a write
or a ready is
dropped, a replica
times out, or a
replica returns an
error
ready txid =
678
Time
Error: not all
replicas are
ready
The same thing
abort txid = 678
aborted txid =
678
Replica Failure (2)
Leader
Replica 1
Replica 2
x y
x y
Replica 3
x y
ready txid =
678
commit txid =
678
Known
inconsistent state
Leader must keep
retrying until all
commits succeed
y
committed txid =
678
commit txid =
678
y
committed txid =
678
Time
Replica Failure (2)
Leader
Replica 1
Replica 2
x y
stat txid =
678
commit txid =
678
Finally, the
system is
consistent and
may proceed
committed txid =
678
Time
Replicas attempt
to resume
unfinished
transactions
when they reboot
Replica 3
Leader Failure
What happens if the leader crashes?
Leader must constantly be writing its state to permanent
storage
It must pick up where it left off once it reboots
If there are unconfirmed transactions
Send new write messages, wait for ready to commit replies
If there are uncommitted transactions
Send new commit messages, wait for committed replies
Replicas may see duplicate messages during this
process
Thus, its important that every transaction have a unique txid
Allowing Progress
Key problem: what if the leader crashes and never
recovers?
By default, replicas block until contacted by the leader
Can the system make progress?
Yes, under limited circumstances
After sending a ready to commit message, each replica
starts a timer
The first replica whose timer expires elects itself as the new
leader
Query the other replicas for their status
Send commits to all replicas if they are all ready
However, this only works if all the replicas are alive and
reachable
New Leader
Leader
Replica 1
Replica 2
Replica 3
x y
x y
x y
ready txid =
678
Replica 2s
timeout expires,
begins recovery
procedure
stat txid = 678
ready txid =
678
commit txid =
678
System is
consistent again
Time
committed txid =
678
Deadlock
Leader
Replica 1
Replica 2
x y
x y
Replica 3
x y
ready txid =
678
Replica 2s
timeout expires,
begins recovery
procedure
Cannot proceed,
but cannot abort
stat txid = 678
ready txid =
678
stat txid = 678
Time
Garbage Collection
2PC is somewhat of a misnomer: there is actually a third
phase
Garbage collection
Replicas must retain records of past transactions, just in
case the leader fails
Example, suppose the leader crashes, reboots, and attempts
to commit a transaction that has already been committed
Replicas must remember that this past transaction was
already committed, since committing a second time may lead
to inconsistencies
In practice, leader periodically tells replicas to garbage
collect
Transactions <= some txid in the past may be deleted
2PC Summary
Message complexity: O(2n)
The good: guarantees consistency
The bad:
Write performance suffers if there are failures during the
commit phase
Does not scale gracefully (possible, but difficult to do)
A pure 2PC system blocks all writes if the leader fails
Smarter 2PC systems still blocks all writes if the leader + 1
replica fail
2PC sacrifices availability in favor of consistency
Can 2PC be Fixed?
They issue with 2PC is reliance on the centralized leader
Only the leader knows if a transaction is 100% ready to
commit or not
Thus, if the leader + 1 replica fail, recovery is impossible
Potential solution: Three Phase Commit
Add an additional round of communication
Tell all replicas to prepare to commit, before actually
committed
State of the system can always be deduced by a subset
of alive replicas that can communicate with each other
unless there are partitions (more on this later)
3PC Example
Tell replicas to
commit
At this point, all
replicas are
guaranteed to
be up-to-date
txid = 678; value
=y
ready txid =
678
Replica 1
Replica 2
Replica 3
x y
x y
x y
prepare txid =
678
prepared txid =
678
Time
Begin by
distributing the
update
Wait to receive
ready to
commit from all
replicas
Tell all replicas
that everyone is
ready to
commit
Leader
commit txid =
678
committed txid =
678
Leader Failures
Replica 2s
timeout expires,
begins recovery
procedure
Replica 3 cannot
be in the
committed state,
thus okay to
System is
abort
consistent again
txid = 678; value
=y
ready txid =
678
Replica 1
Replica 2
x y
x y
x y
stat txid = 678
ready txid =
678
abort txid = 678
aborted txid =
678
Replica 3
Time
Begin by
distributing the
update
Wait to receive
ready to
commit from all
replicas
Leader
Leader Failures
Leader
Replica 1
Replica 2
Replica 3
prepare txid =
678
prepared txid =
678
System is
consistent again
Time
Replica 2s
timeout expires,
begins recovery
procedure
All replicas must
have been ready
to commit
stat txid = 678
prepared txid =
678
commit txid =
678
committed txid =
678
Oh Great, I Fixed Everything!
Wrong
3PC is not robust against network partitions
What is a network partition?
A split in the network, such that full n-to-n connectivity is
broken
i.e. not all servers can contact each other
Partitions split the network into one or more disjoint
subnetworks
How can a network partition occur?
A switch or a router may fail, or it may receive an incorrect
routing rule
A cable connecting two racks of servers may develop a fault
Network partitions are very real, they happen all the
Partitioning
txid = 678; value
=y
ready txid =
678
Leader assumes
replicas 2 and 3
have failed,
moves on
System is
inconsistent
Replica 1
Replica 2
x y
x y
x y
Leader
recovery
initiated
prepare txid =
678
prepared txid =
678
Abor
t
commit txid =
678
committed txid =
678
Replica 3
Time
Network
partitions into
two subnets!
Leader
3PC
Summary
Adds an additional phase vs. 2PC
Message complexity: O(3n)
Really four phases with garbage collection
The good: allows the system to make progress under
more failure conditions
The bad:
Extra round of communication makes 3PC even slower than
2PC
Does not work if the network partitions
2PC will simply deadlock if there is a partition, rather than become
inconsistent
In practice, nobody used 3PC
Additional complexity and performance penalty just isnt worth
it
Distributed Commits (2PC and 3PC)
Theory (FLP and CAP)
Quorums (Paxos)
A Moment of Reflection
Goals, revisited:
The system should be able to reach consensus
Consensus [n]: general agreement
The system should be consistent
Data should be correct; no integrity violations
The system should be highly available
Data should be accessible even in the face of arbitrary failures
Achieving these goals may be harder than we thought :(
Huge number of failure modes
Network partitions are difficult to cope with
We havent even considered byzantine failures
What Can Theory Tell Us?
Lets assume the network is synchronous and reliable
Algorithm can be divided into discreet rounds
If a message from r is not received in a round, then r must be
faulty
Since were assuming synchrony, packets cannot be delayed arbitrarily
During each round, r may send m <= n messages
n is the total number of replicas
You might crash before sending all n messages
If we are willing to tolerate f total failures (f < n), how
many rounds of communication do we need to
guarantee consensus?
Consensus in a Synchronous System
Initialization:
All replicas choose a value 0 or 1 (can generalize to more
values if you want)
Properties:
Agreement: all non-faulty processes ultimately choose the
same value
Either 0 or 1 in this case
Validity: if a replica decides on a value, then at least one
replica must have started with that value
This prevents the trivial solution of all replicas always choosing 0,
which is technically perfect consensus but is practically useless
Termination: the system must converge in finite time
Algorithm Sketch
Each replica maintains a map M of all known values
Initially, the vector only contains the replicas own value
e.g. M = {replica1: 0}
Each round, broadcast M to all other replicas
On receipt, construct the union of received M and local M
Algorithm terminates when all non-faulty replicas have
the values from all other non-faulty replicas
Example with three non-faulty replicas (1, 3, and 5)
M = {replica1: 0, replica3: 1, replica5: 0}
Final value is min([Link]())
Bounding Convergence Time
How many rounds will it take if we are willing to tolerate f
failures?
f + 1 rounds
Key insight: all replicas must be sure that all replicas that
did not crash have the same information (so they can
make the same decision)
Proof sketch, assuming f = 2
Worst case scenario is that replicas crash during rounds 1 and 2
During round 1, replica x crashes
All other replicas dont know if x it alive or dead
During round 2, replica y crashes
Clear that x is not alive, but unknown if y is alive or dead
During round 3, no more replicas may crash
All replicas are guaranteed to receive updated info from all other replicas
A More Realistic Model
The previous result is interesting, but unrealistic
We assume that the network is synchronous and reliable
Of course, neither of these things are true in reality
What if the network is asynchronous and reliable?
Replicas may take an arbitrarily long time to respond to
messages
Lets also assume that all faults are crash faults
i.e. if a replica has a problem it crashes and never wakes up
No byzantine faults
The FLP Result
There is no asynchronous algorithm that achieves
consensus on a 1-bit value in the presence of crash
faults. The result is true even if no crash actually occurs!
This is known as the FLP result
Michael J. Fischer, Nancy A. Lynch, and Michael S. Paterson,
1985
Extremely powerful result because:
If you cant agree on 1-bit, generalizing to larger values isnt
going to help you
If you cant converge with crash faults, no way you can
converge with byzantine faults
If you cant converge on a reliable network, no way you can on
FLP Proof Sketch
In an asynchronous system, a replica x cannot tell
whether a non-responsive replica y has crashed or it is
just slow
What can x do?
If it waits, it will block since it might never receive the
message from y
If it decides, it may find out later that y made a different
decision
Proof constructs a scenario where each attempt to
decide is overruled by a delayed, asynchronous
message
Thus, the system oscillates between 0 and 1 never converges
Impact of FLP
FLP proves that any fault-tolerant distributed algorithm
attempting to reach consensus has runs that never
terminate
These runs are extremely unlikely (probability zero)
Yet they imply that we cant find a totally correct solution
And so consensus is impossible (not always possible)
So what can we do?
Use randomization, probabilistic guarantees (gossip protocols)
Avoid consensus, use quorum systems (Paxos or Raft)
In other words, trade off consistency in favor of availability
Consistency vs. Availability
FLP states that perfect consistency is impossible
Practically, we can get close to perfect consistency, but at
significant cost
e.g. using 3PC
Availability begins to suffer dramatically under failure conditions
Is there a way to formalize the tradeoff between
consistency and availability?
Eric Brewers CAP Theorem
CAP theorem for distributed data replication
Consistency: updates to data are applied to all or none
Availability: must be able to access all data
Network Partition Tolerance: failures can partition network into subtrees
The Brewer Theorem
No system can simultaneously achieve C and A and P
Typical interpretation: C, A, and P: choose 2
In practice, all networks may partition, thus you must choose P
So a better interpretation might be C or A: choose 1
Never formally proved or published
Yet widely accepted as a rule of thumb
CAP Examples
(key, 1)
d
Rea
A+P
Write
Error: Service
Unavailable
(key, 1)
2)
Replicate
(key, 1)
Availability
(key, 1)
C+P
Write
(key, 1)
2)
Impact of partitions
Not consistent
Consistency
Replicate
d
Rea
Client can always
read
Reads must always
return accurate results
Impact of partitions
No availability
C or A: Choose 1
Taken to the extreme, CAP suggests a binary division in
distributed systems
Your system is consistent or available
In practice, its more like a spectrum of possibilities
Perfect
Consistency
Financial
information must
Always
Available
Attempt to balance
correctness with
Serve content to all
visitors, regardless of
Distributed Commits (2PC and 3PC)
Theory (FLP and CAP)
Quorums (Paxos)
Strong Consistency, Revisited
2PC and 3PC achieve strong consistency, but they have
significant shortcomings
2PC cannot make progress in the face of leader + 1 replica
failures
3PC loses consistency guarantees in the face of network
partitions
Where do we go from here?
Observation: 2PC and 3PC attempt to reach 100%
agreement
What if 51% of the replicas agree?
Quorum Systems
In law, a quorum is the minimum number of members of
a deliberative body necessary to conduct the business
of that group
When quorum is not met, a legislative body cannot hold a
vote, and cannot change the status quo
E.g. Imagine if only 1 senator showed up to vote in the Senate
Distributed Systems can also use a quorum approach to
consensus
Essentially a voting approach
If replicas agree (out of N), then an update can be committed
Advantages of Quorums
Availability: quorum systems are more resilient in the
face of failures
Quorum systems can be designed to tolerate both benign and
byzantine failures
Efficiency: can significantly reduce communication
complexity
Do not require all servers in order to perform an operation
Requires a subset of them for each operation
High-Level Quorum Example
ts:
1
ts:
ts:23
Bob:
Bob:
Bob:
$300
$400
$375
Writ
e
ts: 1
ts: 3
Bob:
Bob:
$300
$375
ts:
1
ts:
ts:23
Bob:
Bob:
Bob:
$300
$400
$375
ts: 1
Bob:
$300
ts:
ts:12
Bob:
Bob:
$300
$400
Read
Challenges
1. Ensuring that at least
replicas commit each
update
2. Ensuring that updates have
the correct logical ordering
Timesta
mp
Balan
ce
$300
$400
$375
Paxos
Replication protocol that ensures a global ordering of
updates
All writes into the system are ordered in logical time
Replicas agree on the order of committed writes
Uses a quorum approach to consensus
One replica is always chosen as the leader
The protocol moves forward as long as replicas agree
The Paxos protocol is actually a theoretical proof
Well be discussing a concrete implementation of the protocol
Paxos for System Builders, Jonathan Kirsch and Yair Amir.
[Link]
jak/docs/paxos_for_system_builders.pdf
History of Paxos
Developed by Turing award winner Leslie Lamport
First published as a tech report in 1989
Journal refused to publish it, nobody understood the protocol
Formally published in 1998
Again, nobody understands it
Leslie Lamport publishes Paxos Made Simple in 2001
People start to get the protocol
Reaches widespread fame in 2006-2007
Used by Google in their Chubby distributed mutex system
Zookeeper is the open-source version of Chubby
Paxos at a High-Level
1. Replicas elect a leader and agree on the view number
The view is a logical clock that divides time into epochs
During each view, there is a single leader
2. The leader collects promises from the replicas
Replicas promise to only accept proposals from the current or
future views
Prevents replicas from going back in time
Leader learns about proposed updates from the previous view
that havent yet been accepted
3. The leader proposes updates and replicas accept them
Start by completing unfished updates from the previous view
Then move on to new writes from clients
View Selection
All replicas have a view
number
Goal is to have all replicas
agree on the view
Replica 0 Replica 1 Replica 2 Replica 3 Replica 4
Time
Broadcast view to all other
replicas
If a replica receives
broadcasts with the same
view, assume the view is
correct
If a replica receives a
broadcast with a larger view,
adopt
it and rebroadcast
Leader
is replica
with ID = view % N
Prepare/Promise
Replica 0 Replica 1 Replica 2 Replica 3 Replica 4
Leader must ensure that a
quorum of replicas exist
13
13
13
13
13
prepare view=5
clock= 13
Replicas promise to not
accept any messages with
view < v
Replicas wont elect a new
leader until the current one
Time
promise view=5
clock=13
Commit/Accept
Replica 0 Replica 1 Replica 2 Replica 3 Replica 4
All client requests are
serialized through the
leader
13
13
13
13
13
write
Replicas write the new
value to temporary
storage
Increment the clock
and commit the new
value after receiving
accept messages
accept
clock=14
14
x
OK
14
x
14
14
x
14
x
Time
commit
clock=14
Paxos Review
Three phases: elect leader, collect promises,
commit/accept
Message complexity: roughly O(n2+n)
However, more like O(n2) in steady state (repeated
commit/accept)
Two logical clocks:
1. The view increments each time the leader changes
Replicas promise not to accept updates from prior views
2. The clock increments after each update/write
Maintains the global ordering of updates
Replicas set a timeout every time they hear from the
leader
. Increment the view and elect a new leader if the timeout
expires
Failure Modes
1. What happens if a commit fails?
2. What happens during a partition?
3. What happens after the leader fails?
Bad Commit
What happens if a
quorum does not
accept a commit?
13
13
13
13
commit
clock=14
x
accept
clock=14
commit
clock=14
accept
clock=14
Replicas that fall
behind can reconcile
by downloading
13
Time
Leader must retry until
quorum is reached, or
broadcast an abort
Replica 0 Replica 1 Replica 2 Replica 3 Replica 4
14
x
14
x
14
14
x
Partitions (1)
What happens during a
partition?
The group with a quorum (if one
exists) keeps making progress
This may require a new leader
election
Group with < replicas cannot form
a quorum or accept updates
Once partition is fixed, either:
Hold a new leader election
and move forward
Or, reconcile with up-to-date
Time
Replica 0 Replica 1 Replica 2 Replica 3 Replica 4
Partitions (2)
What happens during a partition?
The group with a quorum (if one
exists) keeps making progress
This may require a new leader
election
Group with < replicas cannot form
a quorum or accept updates
What happens when the view =
0 group attempts to rejoin?
Promises for view = 1 prevent
the old leader from interfering
with the new quorum
Time
Replica 0 Replica 1 Replica 2 Replica 3 Replica 4
Replica 0 Replica 1 Replica 2 Replica 3 Replica 4
Leader Failure (1)
13
Increment view, elect new leader
13
x
x
prepare clock=
13
promise
clock=13
Leader is
unaware of
uncommitted
update
Leader announces
commit
a new update with
clock=14
clock=14, which
is rejected
Replica
3 isby
desynchronized,
replica
3
must
reconcile
with another
13
13
Time
commit
clock=14
What happens if there is an
uncommitted update with no
quorum?
13
Replica 0 Replica 1 Replica 2 Replica 3 Replica 4
Leader Failure (2)
13
Leader is aware of
uncommitted
update
Leader must
recommit the
original clock=14
update
13
13
x
x
Increment view, elect new leader
prepare clock=
13
promise
clock=13
commit
clock=14
x
13
Time
commit
clock=14
What happens if there is an
uncommitted update with no
quorum?
13
Replica 0 Replica 1 Replica 2 Replica 3 Replica 4
Leader Failure (3)
13
commit
clock=14
What happens if there is an
uncommitted update with a
quorum?
13
13
13
x
x
By definition, leader must
become aware of the
commit
uncommitted update
clock=14
Recall that the leader
must collect promises
Leader must recommit the
original clock=14 update
Send prepares, collect promises
Time
Increment view, elect new leader
13
The Devil in the Details
Clearly, Paxos is complicated
Things we havent covered:
Reconciliation how to bring a replica up-to-date
Managing the queue of updates from clients
Updates may be sent to any replica
Replicas are responsible for responding to clients who contact them
Replicas may need to re-forward updates if the leader changes
Garbage collection
Replicas need to remember the exact history of updates, in case the
leader changes
Periodically, the lists need to be garbage collected
Odds and Ends
Byzantine Generals
Gossip
Byzantine Generals Problem
Name coined by Leslie Lamport
Several Byzantine Generals are
laying siege to an enemy city
They have to agree on a
common strategy: attack or
retreat
They can only communicate by
messenger
Some generals may be traitors
(their identity is unknown)
Do you see the connection with the consensus problem?
Byzantine
Distributed Systems
Goals
1. All loyal lieutenants obey the same order
2. If the commanding general is loyal, then every loyal
lieutenant obeys the order he sends
Can the problem be solved?
. Yes, iff there at least 3m+1 generals in the presence of m
traitors
. E.g. if there are 3 generals, even 1 traitor makes the problem
unsolvable
Bazillion variations on the basic problem
. What if messages are cryptographically signed (e.g. they are
unforgeable)?
. What if communication is not g x g (i.e. some pairs of generals
cannot communicate)?
Alternatives to Quorums
Quorums favor consistency over availability
If no quorum exists, then the system stops accepting writes
Significant overhead maintaining consistent replicated state
What if eventual consistency is okay?
Favor availability over consistency
Results may be stale or incorrect sometimes (hopefully only in
rare cases)
Gossip protocols
Replicas periodically, randomly exchange state with each
other
No strong consistency guarantees but
Surprisingly fast and reliable convergence to up-to-date state
Requires vector clocks or better in order to causally order
events
Sources
1. Some slides courtesy of Cristina Nita-Rotaru ([Link]
2. The Part-Time Parliament, Leslie Lamport. http://
[Link]/en-us/um/people/lamport/pubs/[Link]
3. Paxos Made Simple, Leslie Lamport. http://
[Link]/en-us/um/people/lamport/pubs/[Link]
4. Paxos for System Builders, Jonathan Kirsch and Yair Amir. [Link]
jak/docs/paxos_for_system_builders.pdf
5. The Chubby Lock Service for Loosely-Coupled Distributed Systems, Mike Burrows. http://
[Link]/archive/[Link]
6. Paxos Made Live An Engineering Perspective, Tushar Deepak Chandra, Robert Griesemer, Joshua
Redstone. [Link]