6.
888:
Lecture 3
Data Center Congestion Control
Mohammad Alizadeh
Spring 2016
1
100Kbps–100Mbps links Transport
~100ms latency inside the DC
INTERNET
Fabric
10–40Gbps links
~10–100μs latency
Servers
Transport
inside the DC
INTERNET
Fabric
Interconnect for distributed compute workloads
data- map-
web
Servers app cache HPC monitoring
base reduce
What’s Different About DC Transport?
Network characteristics
– Very high link speeds (Gb/s); very low latency (microseconds)
Application characteristics
– Large-scale distributed computation
Challenging traffic patterns
– Diverse mix of mice & elephants
– Incast
Cheap switches
– Single-chip shared-memory devices; shallow buffers
4
Data Center Workloads
Mice & Elephants
Short messages
(e.g., query, coordination) Low Latency
Large flows
(e.g., data update, backup) High Throughput
Incast
Worker 1 • Synchronized fan-in congestion
Aggregator
Worker 2
Worker 3
RTOmin = 300 ms
Worker 4 TCP timeout
Vasudevan et al. (SIGCOMM’09)
Incast in Bing
MLA Query Completion Time (ms)
Requests are jittered over 10ms window.
Jittering trades of median for high percentiles
Jittering switched off around 8:30 am.
7
DC Transport Requirements
1. Low Latency
– Short messages, queries
2. High Throughput
– Continuous data updates, backups
3. High Burst Tolerance
– Incast
The challenge is to achieve these together
8
High Throughput Low Latency
Baseline fabric latency (propagation + switching): 10 microseconds
High Throughput Low Latency
Baseline fabric latency (propagation + switching): 10 microseconds
High throughput requires buffering for rate mismatches
… but this adds significant queuing latency
Data Center TCP
TCP in the Data Center
TCP [Jacobsen et al.’88] is widely used in the data center
– More than 99% of the traffic
Operators work around TCP problems
‒ Ad-hoc, inefficient, often expensive solutions
‒ TCP is deeply ingrained in applications
Practical deployment is hard
keep it simple!
Review: The TCP Algorithm
Sender 1 Additive Increase:
W W+1 per round-trip time
Multiplicative Decrease:
W W/2 per drop or ECN mark
ECN Mark (1 bit)
Window Size (Rate)
Receiver
Time
Sender 2
ECN = Explicit Congestion Notification
TCP Buffer Requirement
Bandwidth-delay product rule of thumb:
– A single flow needs C×RTT buffers for 100% Throughput.
B < C×RTT B ≥ C×RTT
Buffer Size
B
B
Throughput
100% 100%
Reducing Buffer Requirements
Appenzeller et al. (SIGCOMM ‘04):
– Large # of flows: is enough.
Window Size
(Rate)
Buffer Size
Throughput
100%
15
Reducing Buffer Requirements
Appenzeller et al. (SIGCOMM ‘04):
– Large # of flows: is enough
Can’t rely on stat-mux benefit in the DC.
– Measurements show typically only 1-2 large flows at each server
Key Observation:
Low variance in sending rate Small buffers suffice
16
DCTCP: Main Idea
Extract multi-bit feedback from single-bit stream of ECN marks
– Reduce window size based on fraction of marked packets.
ECN Marks TCP DCTCP
1011110111 Cut window by 50% Cut window by 40%
0000000001 Cut window by 50% Cut window by 5%
TCP DCTCP
Window Size (Bytes)
Window Size (Bytes)
Time (sec) Time (sec)
DCTCP: Algorithm
Switch side: B Mark K Don’t
Mark
– Mark packets when Queue Length > K.
Sender side:
– Maintain running average of fraction of packets marked (α).
# of marked ACKs
each RTT : F (1 g) gF
Total # of ACKs
Adaptive window decreases:
W (1 )W
2
– Note: decrease factor between 1 and 2.
DCTCP vs TCP
Experiment: 2 flows (Win 7 stack), Broadcom 1Gbps Switch
700
Buffer is mostly empty
(Packets)
600
Queue Length(KBytes)
500
400
300
200 DCTCPTCP, 2 flowsIncast by creating a
700
mitigates
Queue Length (Packets)
600
500
400
DCTCP, 2 flows
300
200 TCP
ECN Marking Thresh = 30KB
DCTCP
large buffer headroom
100
100
0
0
Time (seconds)
0
70 0
0
Que Length(Packets)
60 0
50 0
Time (seconds)
40 0
30 0
20 0 T CP , 2 f low s
DC TC P , 2 flo w s
10 0
0
0
Tim e (s ec on ds )
Why it Works
1. Low Latency
Small buffer occupancies → low queuing delay
2. High Throughput
ECN averaging → smooth rate adjustments, low variance
3. High Burst Tolerance
Large buffer headroom → bursts fit
Aggressive marking → sources react before packets are
dropped
21
DCTCP Deployments
21
Discussion
22
What You Said?
Austin: “The paper's performance comparison to RED
seems arbitrary, perhaps RED had traction at the time?
Or just convenient as the switches were capable of
implementing it?”
23
Evaluation
Implemented in Windows stack.
Real hardware, 1Gbps and 10Gbps experiments
– 90 server testbed
– Broadcom Triumph 48 1G ports – 4MB shared memory
– Cisco Cat4948 48 1G ports – 16MB shared memory
– Broadcom Scorpion 24 10G ports – 4MB shared memory
Numerous micro-benchmarks
– Throughput and Queue Length – Fairness and Convergence
– Multi-hop – Incast
– Queue Buildup – Static vs Dynamic Buffer Mgmt
– Buffer Pressure
Bing cluster benchmark
Bing Benchmark (baseline)
Background Flows Query Flows
25
Bing Benchmark (scaled 10x)
Incast
Deep buffers fix
Completion Time (ms)
incast, but increase
latency
DCTCP good for both
incast & latency
Query Traffic Short messages
(Incast bursts) (Delay-sensitive)
What You Said
Amy: “I find it unsatisfying that the details of many
congestion control protocols (such at these) are so
complicated! ... can we create a parameter-less
congestion control protocol that is similar in behavior to
DCTCP or TIMELY?”
Hongzi: “Is there a general guideline to tune the
parameters, like alpha, beta, delta, N, T_low, T_high, in
the system?”
27
A bit of Analysis
B K
How much buffering does DCTCP need for
100% throughput?
Need to quantify queue size oscillations (Stability).
Packets sent in this
Window Size RTT are marked.
W*+1
W*
(W*+1)(1-α/2)
# of pkts in last RTT of Period
a=
# of pkts in Period Time
22
A bit of Analysis
B K
How small can queues be without loss of
throughput?
Need to quantify queue size oscillations (Stability).
for TCP:
K > (1/7) C x RTT
K > C x RTT
What assumptions does the
model make?
22
What You Said
Anurag: “In both the papers, one of the difference I saw
from TCP was that these protocols don’t have the “slow
start” phase, where the rate grows exponentially
starting from 1 packet/RTT.”
30
Convergence Time
DCTCP takes at most ~40% more RTTs than TCP
– “Analysis of DCTCP: Stability, Convergence, and Fairness,” SIGMETRICS 2011
Intuition: DCTCP makes smaller adjustments than TCP, but makes
them much more frequently
TCP DCTCP
31
TIMELY
Slides by Radhika Mittal (Berkeley)
Qualities of RTT
• Fine-grained and informative
• Quick response time
• No switch support needed
• End-to-end metric
• Works seamlessly with QoS
RTT correlates with queuing delay
What You Said
Ravi: “The first thing that struck me while reading these
papers was how different their approaches were. DCTCP even
states that delay-based protocols are "susceptible to noise in
the very low latency environment of data centers" and that
"the accurate measurement of such small increases in
queuing delay is a daunting task". Then, I noticed that there
is a 5 year gap between these two papers… “
Arman: “They had to resort to extraordinary measures to
ensure that the timestamps accurately reflect the time at
which a packet was put on wire…”
35
Accurate RTT Measurement
Hardware Assisted RTT Measurement
Hardware Timestamps
– mitigate noise in measurements
Hardware Acknowledgements
– avoid processing overhead
Hardware vs Software Timestamps
Kernel Timestamps introduce significant noise in RTT
measurements compared to HW Timestamps.
Impact of RTT Noise
Throughput degrades with increasing noise in RTT.
Precise RTT measurement is crucial.
TIMELY Framework
Overview
Data
RTT RTT Rate Rate
Measurement Computation Pacing Engine
Engine Engine
Timestamps
Paced
Data
RTT Measurement Engine
Serialization Delay RTT
tsend tcompletion
SENDER
Propagation &
Queuing Delay
RECEIVER
HW ack
RTT = tcompletion – tsend – Serialization Delay
Algorithm Overview
Gradient-based
Increase / Decrease
Algorithm Overview
Gradient-based
Increase / Decrease
gradient = 0
RTT
Time
Algorithm Overview
Gradient-based
Increase / Decrease
gradient > 0
RTT
Time
Algorithm Overview
Gradient-based
Increase / Decrease
gradient < 0
RTT
Time
Algorithm Overview
Gradient-based
Increase / Decrease
RTT
Time
Algorithm Overview
Gradient-based
Increase / Decrease
To navigate the
throughput-latency
tradeoff and
ensure stability.
Why Does Gradient Help Stability?
Source
e(t) =RTT (t) - RTT0
Source
e(t) + ke'(t)
Feedback higher order derivatives
Observe not only error, but change in error – “anticipate” future state
49
What You Said
Arman: “I also think that deducing the queue length
from the gradient model could lead to miscalculations.
For example, consider an Incast scenario, where many
senders transmit simultaneously through the same
path. Noting that every packet will see a long, yet
steady, RTT, they will compute a near-zero gradient and
hence the congestion will continue.”
50
Algorithm Overview
Additive Gradient-based Multiplicativ
Increase Increase / Decrease e Decrease
Tlow Thigh
To navigate the
To keep tail
Better Burst throughput-latency
latency within
Tolerance tradeoff and
acceptable limits.
ensure stability.
Discussion
52
Implementation Set-up
TIMELY is implemented in the context of RDMA.
– RDMA write and read primitives used to invoke NIC
services.
Priority Flow Control is enabled in the network fabric.
– RDMA transport in the NIC is sensitive to packet drops.
– PFC sends out pause frames to ensure lossless network.
“Congestion Spreading” in Lossless Networks
SE
P
SE
PAU
A
U
PA
U
PA
US
SE
E
S E
PAUSE
PAUSE
U
PA PA
US
E
E
SE
PAUS
PAU
54
TIMELY vs PFC
55
TIMELY vs PFC
56
What You Said
Amy: “I was surprised to see that TIMELY performed so
much better than DCTCP. Did the lack of an OS-bypass
for DCTCP impact performance? I wish that the authors
had offered an explanation for this result.”
57
Next time: Load Balancing
58
59