0% found this document useful (0 votes)

130 views77 pages

Consensus

This document discusses how online services are commonly treated as black boxes, where data inputs are assumed to produce correct outputs, but the internal implementation is unknown. It then explores how some large services may be implemented through centralization, sharding, or replication approaches. Replication allows for high availability but introduces challenges in maintaining consistency across replicas, especially in the presence of failures. The document introduces concepts like consensus, consistency, and availability as goals for distributed systems and outlines challenges like consistency failures and byzantine failures that can occur.

Uploaded by

Phạm Huy Hoàng

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

130 views77 pages

Consensus

Uploaded by

Phạm Huy Hoàng

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

CS 3700

Networks and Distributed

Systems
Distributed Consensus and Fault
Tolerance
(or, why cant we all just get along?)

Black Box Online Services

Black Box Service

Storing and retrieving data from online services is

commonplace
We tend to treat these services as black boxes
Data goes in, we assume outputs are correct
We have no idea how the service is implemented

Black Box Online Services

debit_transaction(-$75)
OK
get_recent_transactions()
[, -$75, ]

Storing and retrieving data from online services is

commonplace
We tend to treat these services as black boxes
Data goes in, we assume outputs are correct
We have no idea how the service is implemented

Black Box Online Services

add_item_to_cart(Cheerio
s)
OK
get_cart()
[Lucky Charms,
Cheerios]

Storing and retrieving data from online services is

commonplace
We tend to treat these services as black boxes
Data goes in, we assume outputs are correct
We have no idea how the service is implemented

Black Box Online Services

post_update(I LOLed)
OK
get_newsfeed()
[, {txt: I LOLed,
likes: 87}]

Storing and retrieving data from online services is

commonplace
We tend to treat these services as black boxes
Data goes in, we assume outputs are correct
We have no idea how the service is implemented

Peeling Back the Curtain

Black Box Service

How are large services implemented?

Different types of services may have different requirements
Leads to different design decisions

Centralization
debit_transaction(-$75)
OK

Bob

get_account_balance()
$225

Bob:
Bob:
$300
$225

Advantages of centralization
Easy to setup and deploy
Consistency is guaranteed (assuming correct software implementation)

Shortcomings
No load balancing
Single point of failure

Sharding
debit_account(-$75)

<A-M>
Bob:
Bob:
$300
$225

Bob

get_account_balance()

<N-Z>

$225

Advantages of sharding
Better load balancing
If done intelligently, may allow incremental scalability

Shortcomings
Failures are still devastating

Replication
debit_account(-$75)

100%
Agreeme
nt

Bob

get_account_balance()
$225

Advantages of replication

<A-M>
Bob:
Bob:
$300
$225

<A-M>

<A-M>
Bob:
Bob:
$300
$225

Bob:
Bob:
$300
$225

Better load balancing of reads (potentially)

Resilience against failure; high availability (with some caveats)

Shortcomings
How do we maintain consistency?

Consistency Failures
No
ACK

No
Agreeme
nt
Asynchronous
networks are
problematic

Leader cannot disambiguate

cases where requests and
responses are lost

Bob:
$300
Bob:
Bob:
$300
$225

Bob:
Bob:
$300
$225
No
ACK
Too few
replicas?
Bob:
$300

Bob:
Bob:
$300
$225
Timeout
!

Bob:
Bob:
$300
$225

No
Agreeme
nt

Bob:
Bob:
$300
$225

Byzantine Failures
Bob:
$300
No
Agreeme
nt

Bob:
Bob:
$300
$1000

In some cases,
replicas may be
buggy or
malicious

When discussing Distributed Systems, failures due to

malice are known as Byzantine Failures
Name comes from the Byzantine generals problem
More on this later

Problem and Definitions

Build a distributed system that meets the following
goals:
The system should be able to reach consensus
Consensus [n]: general agreement

The system should be consistent

Data should be correct; no integrity violations

The system should be highly available

Data should be accessible even in the face of arbitrary failures

Challenges:
Many, many different failure modes
Theory tells us that these goals are impossible to achieve
(more on this later)

Distributed Commits (2PC and 3PC)

Theory (FLP and CAP)
Quorums (Paxos)

Forcing Consistency
debit_account(-$75)
OK

Bob:
Bob:
$300
$225
Bob:
Bob:
Bob:
$300
$225
$175

debit_account(-$50)

Bob

Error

Bob:
Bob:
Bob:
$300
$225
$175

One approach building distributed systems is to force them to be

consistent
Guarantee that all replicas receive an update
Or none of them do

If consistency is guaranteed, then reaching consensus is trivial

Distributed Commit Problem

Application that performs operations on multiple replicas
or databases
We want to guarantee that all replicas get updated, or none do

Distributed commit problem:

1. Operation is committed when all participants can perform the
action
2. Once a commit decision is reached, all participants must
perform the action

Two steps gives rise to the Two Phase Commit protocol

Motivating Transactions
transfer_money(Alice, Bob, $100)
debit_account(Alice, -$100)
Error
OK
debit_account(Bob, $100)
OK
Error

Alice:
Alice:
$600
$500
Bob:
Bob:
$300
$400

System becomes inconsistent if any individual action

fails

Simple Transactions
transfer_money(Alice, Bob, $100)
begin_transaction()
debit_account(Alice, -$100)
debit_account(Bob, $100)
At this point, if
there havent
been any errors,
we say the
transaction is
committed

end_transaction()

Alice:
Alice:
$600
$500
Bob:
Bob:
$300
$400

Alice:
$500
Bob:
$400

Actions inside a transaction behave as a single action

Simple Transactions
transfer_money(Alice, Bob, $100)
begin_transaction()
debit_account(Alice, -$100)
debit_account(Bob, $100)
Error

Alice:
$600
Bob:
$300

Alice:
$500

If any individual action fails, the whole transaction fails

Failed transactions have no side effects

Incomplete results during transactions are hidden

ACID Properties
Traditional transactional databases support the
following:
1. Atomicity: all or none; if transaction fails then no changes
are applied to the database
2. Consistency: there are no violations of database integrity
3. Isolation: partial results from incomplete transactions are
hidden
4. Durability: the effects of committed transactions are
permanent

Two Phase Commits (2PC)

Well known techniques used to implement transactions
in centralized databases
E.g. journaling (append-only logs)
Out of scope for this class (take a database class, or CS 5600)

Two Phase Commit (2PC) is a protocol for implementing

transactions in a distributed setting
Protocol operates in rounds
Assume we have leader or coordinator that manages
transactions
Each replica states that it is ready to commit
Leader decides the outcome and instructs replicas to commit
or abort

Assume no byzantine faults (i.e. nobody is malicious)

2PC Example

At this point, all

replicas are
guaranteed to
be up-to-date

txid = 678; value

Replica 1

Replica 2

Replica 3

x y

ready txid =
678

Time

Begin by
distributing the
update
Txid is a logical
clock
Wait to receive
ready to
commit from all
replicas
Tell replicas to
commit

Leader

commit txid =
678

committed txid =
678

Failure Modes
Replica Failure
Before or during the initial promise phase
Before or during the commit

Leader Failure
Before receiving all promises
Before or during sending commits
Before receiving all committed messages

Replica Failure (1)

Leader

Replica 1

Replica 2

Replica 3

x y

txid = 678; value

happens if a write
or a ready is
dropped, a replica
times out, or a
replica returns an
error

ready txid =
678

Time

Error: not all

replicas are
ready
The same thing

abort txid = 678

aborted txid =
678

Replica Failure (2)

Leader

Replica 1

Replica 2

x y

Replica 3

x y

ready txid =
678
commit txid =
678

Known
inconsistent state
Leader must keep
retrying until all
commits succeed

y
committed txid =
678
commit txid =
678

y
committed txid =
678

Time

Replica Failure (2)

Leader

Replica 1

Replica 2

x y
stat txid =
678
commit txid =
678

Finally, the
system is
consistent and
may proceed

committed txid =
678

Time

Replicas attempt
to resume
unfinished
transactions
when they reboot

Replica 3

Leader Failure
What happens if the leader crashes?
Leader must constantly be writing its state to permanent
storage
It must pick up where it left off once it reboots

If there are unconfirmed transactions

Send new write messages, wait for ready to commit replies

If there are uncommitted transactions

Send new commit messages, wait for committed replies

Replicas may see duplicate messages during this

process
Thus, its important that every transaction have a unique txid

Allowing Progress
Key problem: what if the leader crashes and never
recovers?
By default, replicas block until contacted by the leader
Can the system make progress?

Yes, under limited circumstances

After sending a ready to commit message, each replica
starts a timer
The first replica whose timer expires elects itself as the new
leader
Query the other replicas for their status
Send commits to all replicas if they are all ready

However, this only works if all the replicas are alive and
reachable

New Leader

Leader

Replica 1

Replica 2

Replica 3

x y

ready txid =
678

Replica 2s
timeout expires,
begins recovery
procedure

stat txid = 678

ready txid =
678
commit txid =
678

System is
consistent again

Time

committed txid =
678

Deadlock

Leader

Replica 1

Replica 2

x y

Replica 3

x y

ready txid =
678

Replica 2s
timeout expires,
begins recovery
procedure
Cannot proceed,
but cannot abort

stat txid = 678

ready txid =
678
stat txid = 678

Time

Garbage Collection
2PC is somewhat of a misnomer: there is actually a third
phase
Garbage collection

Replicas must retain records of past transactions, just in

case the leader fails
Example, suppose the leader crashes, reboots, and attempts
to commit a transaction that has already been committed
Replicas must remember that this past transaction was
already committed, since committing a second time may lead
to inconsistencies

In practice, leader periodically tells replicas to garbage

collect
Transactions <= some txid in the past may be deleted

2PC Summary
Message complexity: O(2n)
The good: guarantees consistency
The bad:
Write performance suffers if there are failures during the
commit phase
Does not scale gracefully (possible, but difficult to do)
A pure 2PC system blocks all writes if the leader fails
Smarter 2PC systems still blocks all writes if the leader + 1
replica fail

2PC sacrifices availability in favor of consistency

Can 2PC be Fixed?

They issue with 2PC is reliance on the centralized leader
Only the leader knows if a transaction is 100% ready to
commit or not
Thus, if the leader + 1 replica fail, recovery is impossible

Potential solution: Three Phase Commit

Add an additional round of communication
Tell all replicas to prepare to commit, before actually
committed

State of the system can always be deduced by a subset

of alive replicas that can communicate with each other
unless there are partitions (more on this later)

3PC Example

Tell replicas to
commit
At this point, all
replicas are
guaranteed to
be up-to-date

txid = 678; value

=y
ready txid =
678

Replica 1

Replica 2

Replica 3

x y

prepare txid =
678
prepared txid =
678

Time

Begin by
distributing the
update
Wait to receive
ready to
commit from all
replicas
Tell all replicas
that everyone is
ready to
commit

Leader

commit txid =
678
committed txid =
678

Leader Failures

Replica 2s
timeout expires,
begins recovery
procedure
Replica 3 cannot
be in the
committed state,
thus okay to
System is
abort
consistent again

txid = 678; value

=y
ready txid =
678

Replica 1

Replica 2

x y

stat txid = 678

ready txid =
678
abort txid = 678
aborted txid =
678

Replica 3

Time

Begin by
distributing the
update
Wait to receive
ready to
commit from all
replicas

Leader

Leader Failures

Leader

Replica 1

Replica 2

Replica 3

prepare txid =
678
prepared txid =
678

System is
consistent again

Time

Replica 2s
timeout expires,
begins recovery
procedure
All replicas must
have been ready
to commit

stat txid = 678

prepared txid =
678
commit txid =
678
committed txid =
678

Oh Great, I Fixed Everything!

Wrong
3PC is not robust against network partitions
What is a network partition?

A split in the network, such that full n-to-n connectivity is

broken
i.e. not all servers can contact each other

Partitions split the network into one or more disjoint

subnetworks
How can a network partition occur?
A switch or a router may fail, or it may receive an incorrect
routing rule
A cable connecting two racks of servers may develop a fault

Network partitions are very real, they happen all the

Partitioning
txid = 678; value
=y
ready txid =
678

Leader assumes
replicas 2 and 3
have failed,
moves on

System is
inconsistent

Replica 1

Replica 2

x y

x y
Leader
recovery
initiated

prepare txid =
678
prepared txid =
678

Abor
t

commit txid =
678
committed txid =
678

Replica 3

Time

Network
partitions into
two subnets!

Leader

3PC
Summary
Adds an additional phase vs. 2PC
Message complexity: O(3n)
Really four phases with garbage collection

The good: allows the system to make progress under

more failure conditions
The bad:
Extra round of communication makes 3PC even slower than
2PC
Does not work if the network partitions
2PC will simply deadlock if there is a partition, rather than become
inconsistent

In practice, nobody used 3PC

Additional complexity and performance penalty just isnt worth
it

Distributed Commits (2PC and 3PC)

Theory (FLP and CAP)
Quorums (Paxos)

A Moment of Reflection
Goals, revisited:
The system should be able to reach consensus
Consensus [n]: general agreement

The system should be consistent

Data should be correct; no integrity violations

The system should be highly available

Data should be accessible even in the face of arbitrary failures

Achieving these goals may be harder than we thought :(

Huge number of failure modes
Network partitions are difficult to cope with
We havent even considered byzantine failures

What Can Theory Tell Us?

Lets assume the network is synchronous and reliable
Algorithm can be divided into discreet rounds
If a message from r is not received in a round, then r must be
faulty
Since were assuming synchrony, packets cannot be delayed arbitrarily

During each round, r may send m <= n messages

n is the total number of replicas
You might crash before sending all n messages

If we are willing to tolerate f total failures (f < n), how

many rounds of communication do we need to
guarantee consensus?

Consensus in a Synchronous System

Initialization:
All replicas choose a value 0 or 1 (can generalize to more
values if you want)

Properties:
Agreement: all non-faulty processes ultimately choose the
same value
Either 0 or 1 in this case

Validity: if a replica decides on a value, then at least one

replica must have started with that value
This prevents the trivial solution of all replicas always choosing 0,
which is technically perfect consensus but is practically useless

Termination: the system must converge in finite time

Algorithm Sketch
Each replica maintains a map M of all known values
Initially, the vector only contains the replicas own value
e.g. M = {replica1: 0}

Each round, broadcast M to all other replicas

On receipt, construct the union of received M and local M

Algorithm terminates when all non-faulty replicas have

the values from all other non-faulty replicas
Example with three non-faulty replicas (1, 3, and 5)
M = {replica1: 0, replica3: 1, replica5: 0}
Final value is min([Link]())

Bounding Convergence Time

How many rounds will it take if we are willing to tolerate f

failures?
f + 1 rounds

Key insight: all replicas must be sure that all replicas that
did not crash have the same information (so they can
make the same decision)
Proof sketch, assuming f = 2
Worst case scenario is that replicas crash during rounds 1 and 2
During round 1, replica x crashes
All other replicas dont know if x it alive or dead

During round 2, replica y crashes

Clear that x is not alive, but unknown if y is alive or dead

During round 3, no more replicas may crash

All replicas are guaranteed to receive updated info from all other replicas

A More Realistic Model

The previous result is interesting, but unrealistic
We assume that the network is synchronous and reliable
Of course, neither of these things are true in reality

What if the network is asynchronous and reliable?

Replicas may take an arbitrarily long time to respond to
messages

Lets also assume that all faults are crash faults

i.e. if a replica has a problem it crashes and never wakes up
No byzantine faults

The FLP Result

There is no asynchronous algorithm that achieves
consensus on a 1-bit value in the presence of crash
faults. The result is true even if no crash actually occurs!
This is known as the FLP result
Michael J. Fischer, Nancy A. Lynch, and Michael S. Paterson,
1985

Extremely powerful result because:

If you cant agree on 1-bit, generalizing to larger values isnt
going to help you
If you cant converge with crash faults, no way you can
converge with byzantine faults
If you cant converge on a reliable network, no way you can on

FLP Proof Sketch

In an asynchronous system, a replica x cannot tell
whether a non-responsive replica y has crashed or it is
just slow
What can x do?
If it waits, it will block since it might never receive the
message from y
If it decides, it may find out later that y made a different
decision

Proof constructs a scenario where each attempt to

decide is overruled by a delayed, asynchronous
message
Thus, the system oscillates between 0 and 1 never converges

Impact of FLP
FLP proves that any fault-tolerant distributed algorithm
attempting to reach consensus has runs that never
terminate
These runs are extremely unlikely (probability zero)
Yet they imply that we cant find a totally correct solution
And so consensus is impossible (not always possible)

So what can we do?

Use randomization, probabilistic guarantees (gossip protocols)
Avoid consensus, use quorum systems (Paxos or Raft)
In other words, trade off consistency in favor of availability

Consistency vs. Availability

FLP states that perfect consistency is impossible

Practically, we can get close to perfect consistency, but at
significant cost
e.g. using 3PC
Availability begins to suffer dramatically under failure conditions

Is there a way to formalize the tradeoff between

consistency and availability?

Eric Brewers CAP Theorem

CAP theorem for distributed data replication
Consistency: updates to data are applied to all or none
Availability: must be able to access all data
Network Partition Tolerance: failures can partition network into subtrees

The Brewer Theorem

No system can simultaneously achieve C and A and P

Typical interpretation: C, A, and P: choose 2

In practice, all networks may partition, thus you must choose P
So a better interpretation might be C or A: choose 1

Never formally proved or published

Yet widely accepted as a rule of thumb

CAP Examples
(key, 1)
d
Rea

A+P

Write

Error: Service
Unavailable

(key, 1)
2)

Replicate

(key, 1)

Availability

(key, 1)

C+P
Write

(key, 1)
2)

Impact of partitions
Not consistent

Consistency
Replicate

d
Rea

Client can always

read

Reads must always

return accurate results

Impact of partitions
No availability

C or A: Choose 1
Taken to the extreme, CAP suggests a binary division in
distributed systems
Your system is consistent or available

In practice, its more like a spectrum of possibilities

Perfect
Consistency

Financial
information must

Always
Available

Attempt to balance
correctness with

Serve content to all

visitors, regardless of

Distributed Commits (2PC and 3PC)

Theory (FLP and CAP)
Quorums (Paxos)

Strong Consistency, Revisited

2PC and 3PC achieve strong consistency, but they have
significant shortcomings
2PC cannot make progress in the face of leader + 1 replica
failures
3PC loses consistency guarantees in the face of network
partitions

Where do we go from here?

Observation: 2PC and 3PC attempt to reach 100%
agreement
What if 51% of the replicas agree?

Quorum Systems
In law, a quorum is the minimum number of members of
a deliberative body necessary to conduct the business
of that group
When quorum is not met, a legislative body cannot hold a
vote, and cannot change the status quo
E.g. Imagine if only 1 senator showed up to vote in the Senate

Distributed Systems can also use a quorum approach to

consensus
Essentially a voting approach
If replicas agree (out of N), then an update can be committed

Advantages of Quorums

Availability: quorum systems are more resilient in the

face of failures
Quorum systems can be designed to tolerate both benign and
byzantine failures

Efficiency: can significantly reduce communication

complexity
Do not require all servers in order to perform an operation
Requires a subset of them for each operation

High-Level Quorum Example

ts:
1
ts:
ts:23
Bob:
Bob:
Bob:
$300
$400
$375

Writ
e

ts: 1
ts: 3
Bob:
Bob:
$300
$375

ts:
1
ts:
ts:23
Bob:
Bob:
Bob:
$300
$400
$375

ts: 1
Bob:
$300

ts:
ts:12
Bob:
Bob:
$300
$400

Read

Challenges
1. Ensuring that at least
replicas commit each
update
2. Ensuring that updates have
the correct logical ordering

Timesta
mp

Balan
ce

$300

$400

$375

Paxos
Replication protocol that ensures a global ordering of
updates
All writes into the system are ordered in logical time
Replicas agree on the order of committed writes

Uses a quorum approach to consensus

One replica is always chosen as the leader

The protocol moves forward as long as replicas agree

The Paxos protocol is actually a theoretical proof

Well be discussing a concrete implementation of the protocol
Paxos for System Builders, Jonathan Kirsch and Yair Amir.
[Link]
jak/docs/paxos_for_system_builders.pdf

History of Paxos
Developed by Turing award winner Leslie Lamport
First published as a tech report in 1989
Journal refused to publish it, nobody understood the protocol

Formally published in 1998

Again, nobody understands it

Leslie Lamport publishes Paxos Made Simple in 2001

People start to get the protocol

Reaches widespread fame in 2006-2007

Used by Google in their Chubby distributed mutex system
Zookeeper is the open-source version of Chubby

Paxos at a High-Level
1. Replicas elect a leader and agree on the view number
The view is a logical clock that divides time into epochs
During each view, there is a single leader

2. The leader collects promises from the replicas

Replicas promise to only accept proposals from the current or
future views
Prevents replicas from going back in time
Leader learns about proposed updates from the previous view
that havent yet been accepted

3. The leader proposes updates and replicas accept them

Start by completing unfished updates from the previous view
Then move on to new writes from clients

View Selection

All replicas have a view

number
Goal is to have all replicas
agree on the view

Replica 0 Replica 1 Replica 2 Replica 3 Replica 4

Time

Broadcast view to all other

replicas
If a replica receives
broadcasts with the same
view, assume the view is
correct
If a replica receives a
broadcast with a larger view,
adopt
it and rebroadcast
Leader
is replica
with ID = view % N

Prepare/Promise

Replica 0 Replica 1 Replica 2 Replica 3 Replica 4

Leader must ensure that a

quorum of replicas exist

prepare view=5
clock= 13

Replicas promise to not

accept any messages with
view < v
Replicas wont elect a new
leader until the current one

Time

promise view=5
clock=13

Commit/Accept

Replica 0 Replica 1 Replica 2 Replica 3 Replica 4

All client requests are

serialized through the
leader

write

Replicas write the new

value to temporary
storage
Increment the clock
and commit the new
value after receiving
accept messages

accept
clock=14
14
x
OK

14
x

Time

commit
clock=14

Paxos Review

Three phases: elect leader, collect promises,

commit/accept
Message complexity: roughly O(n2+n)
However, more like O(n2) in steady state (repeated
commit/accept)

Two logical clocks:

1. The view increments each time the leader changes

Replicas promise not to accept updates from prior views

2. The clock increments after each update/write

Maintains the global ordering of updates

Replicas set a timeout every time they hear from the

leader
. Increment the view and elect a new leader if the timeout
expires

Failure Modes

1. What happens if a commit fails?

2. What happens during a partition?
3. What happens after the leader fails?

Bad Commit
What happens if a
quorum does not
accept a commit?

commit
clock=14
x

accept
clock=14
commit
clock=14
accept
clock=14

Replicas that fall

behind can reconcile
by downloading

Time

Leader must retry until

quorum is reached, or
broadcast an abort

Replica 0 Replica 1 Replica 2 Replica 3 Replica 4

14
x

Partitions (1)

What happens during a

partition?

The group with a quorum (if one

exists) keeps making progress
This may require a new leader
election
Group with < replicas cannot form
a quorum or accept updates
Once partition is fixed, either:
Hold a new leader election
and move forward
Or, reconcile with up-to-date

Time

Replica 0 Replica 1 Replica 2 Replica 3 Replica 4

Partitions (2)
What happens during a partition?

The group with a quorum (if one

exists) keeps making progress
This may require a new leader
election
Group with < replicas cannot form
a quorum or accept updates

What happens when the view =

0 group attempts to rejoin?
Promises for view = 1 prevent
the old leader from interfering
with the new quorum

Time

Replica 0 Replica 1 Replica 2 Replica 3 Replica 4

Leader Failure (1)

Increment view, elect new leader

x
x

prepare clock=
13
promise
clock=13

Leader is
unaware of
uncommitted
update
Leader announces
commit
a new update with
clock=14
clock=14, which
is rejected
Replica
3 isby
desynchronized,
replica
3
must
reconcile
with another

Time

commit
clock=14
What happens if there is an
uncommitted update with no
quorum?

Replica 0 Replica 1 Replica 2 Replica 3 Replica 4

Leader Failure (2)

Leader is aware of
uncommitted
update
Leader must
recommit the
original clock=14
update

x
x

Increment view, elect new leader

prepare clock=
13
promise
clock=13
commit
clock=14
x

Time

commit
clock=14
What happens if there is an
uncommitted update with no
quorum?

Replica 0 Replica 1 Replica 2 Replica 3 Replica 4

Leader Failure (3)

commit
clock=14
What happens if there is an
uncommitted update with a
quorum?

x
x

By definition, leader must

become aware of the
commit
uncommitted update
clock=14
Recall that the leader
must collect promises
Leader must recommit the
original clock=14 update

Send prepares, collect promises

Time

Increment view, elect new leader

The Devil in the Details

Clearly, Paxos is complicated
Things we havent covered:
Reconciliation how to bring a replica up-to-date
Managing the queue of updates from clients
Updates may be sent to any replica
Replicas are responsible for responding to clients who contact them
Replicas may need to re-forward updates if the leader changes

Garbage collection
Replicas need to remember the exact history of updates, in case the
leader changes
Periodically, the lists need to be garbage collected

Odds and Ends

Byzantine Generals
Gossip

Byzantine Generals Problem

Name coined by Leslie Lamport
Several Byzantine Generals are
laying siege to an enemy city
They have to agree on a
common strategy: attack or
retreat
They can only communicate by
messenger
Some generals may be traitors
(their identity is unknown)

Do you see the connection with the consensus problem?

Byzantine
Distributed Systems
Goals
1. All loyal lieutenants obey the same order
2. If the commanding general is loyal, then every loyal
lieutenant obeys the order he sends

Can the problem be solved?

. Yes, iff there at least 3m+1 generals in the presence of m
traitors
. E.g. if there are 3 generals, even 1 traitor makes the problem
unsolvable

Bazillion variations on the basic problem

. What if messages are cryptographically signed (e.g. they are
unforgeable)?
. What if communication is not g x g (i.e. some pairs of generals
cannot communicate)?

Alternatives to Quorums

Quorums favor consistency over availability

If no quorum exists, then the system stops accepting writes
Significant overhead maintaining consistent replicated state

What if eventual consistency is okay?

Favor availability over consistency
Results may be stale or incorrect sometimes (hopefully only in
rare cases)

Gossip protocols
Replicas periodically, randomly exchange state with each
other
No strong consistency guarantees but
Surprisingly fast and reliable convergence to up-to-date state
Requires vector clocks or better in order to causally order
events

Sources

1. Some slides courtesy of Cristina Nita-Rotaru ([Link]

2. The Part-Time Parliament, Leslie Lamport. http://
[Link]/en-us/um/people/lamport/pubs/[Link]
3. Paxos Made Simple, Leslie Lamport. http://
[Link]/en-us/um/people/lamport/pubs/[Link]
4. Paxos for System Builders, Jonathan Kirsch and Yair Amir. [Link]
jak/docs/paxos_for_system_builders.pdf
5. The Chubby Lock Service for Loosely-Coupled Distributed Systems, Mike Burrows. http://
[Link]/archive/[Link]
6. Paxos Made Live An Engineering Perspective, Tushar Deepak Chandra, Robert Griesemer, Joshua
Redstone. [Link]

Replication Control in Distributed Systems
No ratings yet
Replication Control in Distributed Systems
29 pages
Lecture 11A - Replication Control
No ratings yet
Lecture 11A - Replication Control
15 pages
Distributed Systems Recovery Guide
No ratings yet
Distributed Systems Recovery Guide
15 pages
Aks Replication Control
No ratings yet
Aks Replication Control
71 pages
Fault Tolerance in Distributed Systems
100% (1)
Fault Tolerance in Distributed Systems
21 pages
15-440 Distributed Systems: Fault Tolerance, Logging and Recovery Thursday Oct 8, 2015
No ratings yet
15-440 Distributed Systems: Fault Tolerance, Logging and Recovery Thursday Oct 8, 2015
30 pages
Replication and Consistency in Distributed Systems (Cont'd)
No ratings yet
Replication and Consistency in Distributed Systems (Cont'd)
17 pages
Fault Tolerant Message Passing Systems
No ratings yet
Fault Tolerant Message Passing Systems
26 pages
Distributed Computing Replication Control
No ratings yet
Distributed Computing Replication Control
71 pages
System Recovery Mechanisms Explained
No ratings yet
System Recovery Mechanisms Explained
38 pages
Lecture 9 Distributed Transactions
No ratings yet
Lecture 9 Distributed Transactions
7 pages
11 Distributed1
No ratings yet
11 Distributed1
42 pages
Unit 4 - DSRM
No ratings yet
Unit 4 - DSRM
5 pages
6 Replication Nhom3
No ratings yet
6 Replication Nhom3
44 pages
Midterm Cheatsheet
No ratings yet
Midterm Cheatsheet
2 pages
Unit 3-1
No ratings yet
Unit 3-1
26 pages
Chapter 8 Fault Tolerance
No ratings yet
Chapter 8 Fault Tolerance
20 pages
DS Unit5
No ratings yet
DS Unit5
13 pages
Distributed System Recovery Guide
No ratings yet
Distributed System Recovery Guide
119 pages
Lecture 10
No ratings yet
Lecture 10
55 pages
Chapter 8 - Fault Tolerance
No ratings yet
Chapter 8 - Fault Tolerance
19 pages
Unit IV - Distributed Transaction Processing
No ratings yet
Unit IV - Distributed Transaction Processing
38 pages
Lecture 13
No ratings yet
Lecture 13
37 pages
Word Unit5
No ratings yet
Word Unit5
19 pages
CSE446 Lecture 4
No ratings yet
CSE446 Lecture 4
32 pages
Fault Tolerance in Distributed Systems
No ratings yet
Fault Tolerance in Distributed Systems
21 pages
Distributed Systems Practitioners Dimos Raptis Raspoznan
No ratings yet
Distributed Systems Practitioners Dimos Raptis Raspoznan
259 pages
Blockchain - Unit1
No ratings yet
Blockchain - Unit1
115 pages
DSC5
No ratings yet
DSC5
13 pages
4th Unit Topics Recovery
No ratings yet
4th Unit Topics Recovery
73 pages
CSE446 Lecture 4
No ratings yet
CSE446 Lecture 4
30 pages
Fault Tolerance:-: Introduction, Process Resilience, Distributed Commit, Recovery
No ratings yet
Fault Tolerance:-: Introduction, Process Resilience, Distributed Commit, Recovery
52 pages
DS Chapter V8.0fault Tolerance
No ratings yet
DS Chapter V8.0fault Tolerance
23 pages
Dos 6
No ratings yet
Dos 6
22 pages
Week 6
No ratings yet
Week 6
177 pages
Chapter 7-Fault Tolerance
No ratings yet
Chapter 7-Fault Tolerance
71 pages
Unit-3 Part2
No ratings yet
Unit-3 Part2
74 pages
REPLICATION
No ratings yet
REPLICATION
20 pages
Cs3551 - Dss-Unit - IV Notes Final
No ratings yet
Cs3551 - Dss-Unit - IV Notes Final
46 pages
Ds
No ratings yet
Ds
32 pages
Distributed Ledger Technology: The Science of The Blockchain
No ratings yet
Distributed Ledger Technology: The Science of The Blockchain
168 pages
DC Unit 3
No ratings yet
DC Unit 3
44 pages
Unit 3
No ratings yet
Unit 3
62 pages
Week 04
No ratings yet
Week 04
49 pages
Distributed Systems Overview
No ratings yet
Distributed Systems Overview
48 pages
Replication and Consistency in Distributed Systems
No ratings yet
Replication and Consistency in Distributed Systems
65 pages
03 Consistency & Replication
No ratings yet
03 Consistency & Replication
42 pages
Assignment 4 - 044
No ratings yet
Assignment 4 - 044
4 pages
Chapte Four DS
No ratings yet
Chapte Four DS
37 pages
Unit5 Compressed Fault Tolerance - PACE
No ratings yet
Unit5 Compressed Fault Tolerance - PACE
11 pages
Ds Part B
No ratings yet
Ds Part B
30 pages
CheckpointingRecovery ds14
No ratings yet
CheckpointingRecovery ds14
35 pages
Lecture 9 - RPC and Concurrency Control
No ratings yet
Lecture 9 - RPC and Concurrency Control
29 pages
Reliability and Security in The Distributed Databases
No ratings yet
Reliability and Security in The Distributed Databases
29 pages
Fault Tolerance FDCC
No ratings yet
Fault Tolerance FDCC
76 pages
Intro To DS Chapter 6
No ratings yet
Intro To DS Chapter 6
51 pages
Distributed Computing Series 2 Important Topics
No ratings yet
Distributed Computing Series 2 Important Topics
24 pages
5 Transaction Processing
No ratings yet
5 Transaction Processing
70 pages
Unit I-Topic 1
No ratings yet
Unit I-Topic 1
22 pages
Iphone 6s Manual User Guide
No ratings yet
Iphone 6s Manual User Guide
196 pages
CS4800 SU1 15 Assignment 1
No ratings yet
CS4800 SU1 15 Assignment 1
9 pages
Hoang Pham - CV
No ratings yet
Hoang Pham - CV
2 pages
Bulgaria Team Selection Tests 2007 103
No ratings yet
Bulgaria Team Selection Tests 2007 103
2 pages
Polynomial and Function Problems
No ratings yet
Polynomial and Function Problems
2 pages
Exponent Lifting
No ratings yet
Exponent Lifting
3 pages
Exponent Lifting
No ratings yet
Exponent Lifting
3 pages
Internet's Impact: Then and Now
No ratings yet
Internet's Impact: Then and Now
4 pages
Hybrid Dataflow and Von-Neumann Models
No ratings yet
Hybrid Dataflow and Von-Neumann Models
21 pages
ASSESSMENT 1 - 2500 Words (
No ratings yet
ASSESSMENT 1 - 2500 Words (
13 pages
2024 - 2
No ratings yet
2024 - 2
8 pages
Feasib
No ratings yet
Feasib
2 pages
Introduction to Digital Marketing Concepts
86% (7)
Introduction to Digital Marketing Concepts
18 pages
Xinflying 400S User Manual-English
No ratings yet
Xinflying 400S User Manual-English
30 pages
Original Message
No ratings yet
Original Message
309 pages
Oscillations and Waves Notes
No ratings yet
Oscillations and Waves Notes
20 pages
Ruckus Vs Cisco
No ratings yet
Ruckus Vs Cisco
5 pages
SEO Audit Report For
No ratings yet
SEO Audit Report For
13 pages
9 Stewart Cambio de Variablel (9022) PDF
No ratings yet
9 Stewart Cambio de Variablel (9022) PDF
10 pages
Eight Key Supply Chain Processes
No ratings yet
Eight Key Supply Chain Processes
25 pages
Ebook OpenFlows Learning Resources Guide LR
No ratings yet
Ebook OpenFlows Learning Resources Guide LR
10 pages
PM Topic Explainer Notes
No ratings yet
PM Topic Explainer Notes
2 pages
Enginnering-Drawing Lab Report-02
No ratings yet
Enginnering-Drawing Lab Report-02
7 pages
Media Management and Artificial Intelligence Understanding Media Business Models in The Digital Age (Alex Connock)
No ratings yet
Media Management and Artificial Intelligence Understanding Media Business Models in The Digital Age (Alex Connock)
345 pages
Confusion Matrix
No ratings yet
Confusion Matrix
16 pages
Manual de Usuario MEB-9400
No ratings yet
Manual de Usuario MEB-9400
346 pages
Understanding Mobile Commerce and WAP
No ratings yet
Understanding Mobile Commerce and WAP
32 pages
One Hybrid Integration Solution: PT. Sintesa Inti Mitra
No ratings yet
One Hybrid Integration Solution: PT. Sintesa Inti Mitra
14 pages
Deep Packet Inspection
No ratings yet
Deep Packet Inspection
4 pages
C++ (2025 Edition)
No ratings yet
C++ (2025 Edition)
4 pages
Bomtrm
No ratings yet
Bomtrm
949 pages
Multron MX120 V1.1 Smoke Heat Det
No ratings yet
Multron MX120 V1.1 Smoke Heat Det
2 pages
Introduction to Software Design Concepts
No ratings yet
Introduction to Software Design Concepts
11 pages
LTSpice Basics: Controlled Components Guide
No ratings yet
LTSpice Basics: Controlled Components Guide
11 pages
Hyperband: Fast Hyperparameter Optimization
No ratings yet
Hyperband: Fast Hyperparameter Optimization
52 pages
TRANSEC Tech Brief - 0418 - FINAL
No ratings yet
TRANSEC Tech Brief - 0418 - FINAL
2 pages
Manual Item and Schedule Line Categories
No ratings yet
Manual Item and Schedule Line Categories
10 pages