0% found this document useful (0 votes)

103 views59 pages

6.888: Data Center Congestion Control: Mohammad Alizadeh

This 3-sentence summary provides the high-level and essential information from the document: The document discusses data center congestion control and introduces DCTCP, which aims to achieve low latency, high throughput, and high burst tolerance for data center networks. DCTCP modifies TCP to use explicit congestion notifications (ECN) to reduce congestion window size based on the fraction of marked packets rather than a binary notification, allowing it to better control queue buildup and latency. The paper evaluates DCTCP through microbenchmarks and on a Bing cluster, finding that it outperforms TCP in minimizing latency while maintaining high throughput.

Uploaded by

setia wulandari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

103 views59 pages

6.888: Data Center Congestion Control: Mohammad Alizadeh

Uploaded by

setia wulandari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 59

6.

888:
Lecture 3
Data Center Congestion Control
Mohammad Alizadeh

Spring 2016

1
100Kbps–100Mbps links Transport
~100ms latency inside the DC
INTERNET

Fabric

10–40Gbps links
~10–100μs latency

Servers
Transport
inside the DC
INTERNET

Fabric

Interconnect for distributed compute workloads

data- map-
web
Servers app cache HPC monitoring
base reduce
What’s Different About DC Transport?
Network characteristics
– Very high link speeds (Gb/s); very low latency (microseconds)

Application characteristics
– Large-scale distributed computation

Challenging traffic patterns

– Diverse mix of mice & elephants
– Incast

Cheap switches
– Single-chip shared-memory devices; shallow buffers
4
Data Center Workloads
Mice & Elephants

Short messages
(e.g., query, coordination) Low Latency

Large flows
(e.g., data update, backup) High Throughput
Incast
Worker 1 • Synchronized fan-in congestion

Aggregator
Worker 2

Worker 3

RTOmin = 300 ms

Worker 4 TCP timeout

 Vasudevan et al. (SIGCOMM’09)

Incast in Bing
MLA Query Completion Time (ms)

Requests are jittered over 10ms window.

Jittering trades of median for high percentiles
Jittering switched off around 8:30 am.
7
DC Transport Requirements

1. Low Latency
– Short messages, queries

2. High Throughput
– Continuous data updates, backups

3. High Burst Tolerance

– Incast

The challenge is to achieve these together

8
High Throughput Low Latency

Baseline fabric latency (propagation + switching): 10 microseconds

High Throughput Low Latency

Baseline fabric latency (propagation + switching): 10 microseconds

High throughput requires buffering for rate mismatches

… but this adds significant queuing latency
Data Center TCP
TCP in the Data Center

TCP [Jacobsen et al.’88] is widely used in the data center

– More than 99% of the traffic

Operators work around TCP problems

‒ Ad-hoc, inefficient, often expensive solutions
‒ TCP is deeply ingrained in applications

Practical deployment is hard

 keep it simple!
Review: The TCP Algorithm
Sender 1 Additive Increase:
W  W+1 per round-trip time
Multiplicative Decrease:
W  W/2 per drop or ECN mark

ECN Mark (1 bit)

Window Size (Rate)

Receiver

Time

Sender 2

ECN = Explicit Congestion Notification

TCP Buffer Requirement

Bandwidth-delay product rule of thumb:

– A single flow needs C×RTT buffers for 100% Throughput.

B < C×RTT B ≥ C×RTT

Buffer Size

B
B
Throughput

100% 100%
Reducing Buffer Requirements
Appenzeller et al. (SIGCOMM ‘04):
– Large # of flows: is enough.

Window Size
(Rate)

Buffer Size

Throughput
100%

15
Reducing Buffer Requirements
Appenzeller et al. (SIGCOMM ‘04):
– Large # of flows: is enough

Can’t rely on stat-mux benefit in the DC.

– Measurements show typically only 1-2 large flows at each server

Key Observation:
Low variance in sending rate  Small buffers suffice

16
DCTCP: Main Idea
Extract multi-bit feedback from single-bit stream of ECN marks
– Reduce window size based on fraction of marked packets.

ECN Marks TCP DCTCP

1011110111 Cut window by 50% Cut window by 40%

0000000001 Cut window by 50% Cut window by 5%

TCP DCTCP
Window Size (Bytes)
Window Size (Bytes)

Time (sec) Time (sec)

DCTCP: Algorithm
Switch side: B Mark K Don’t
Mark
– Mark packets when Queue Length > K.

Sender side:
– Maintain running average of fraction of packets marked (α).

# of marked ACKs
each RTT : F     (1 g)  gF
Total # of ACKs

 Adaptive window decreases: 

W  (1 )W
 2
– Note: decrease factor between 1 and 2.
DCTCP vs TCP
Experiment: 2 flows (Win 7 stack), Broadcom 1Gbps Switch

700
Buffer is mostly empty
(Packets)

600
Queue Length(KBytes)

500
400
300
200 DCTCPTCP, 2 flowsIncast by creating a
700

mitigates
Queue Length (Packets)

600
500
400

DCTCP, 2 flows
300
200 TCP

ECN Marking Thresh = 30KB

DCTCP

large buffer headroom

100

100
0
0
Time (seconds)

0
70 0
0
Que Length(Packets)

60 0
50 0

Time (seconds)
40 0
30 0
20 0 T CP , 2 f low s
DC TC P , 2 flo w s
10 0
0
0
Tim e (s ec on ds )
Why it Works
1. Low Latency
 Small buffer occupancies → low queuing delay

2. High Throughput
 ECN averaging → smooth rate adjustments, low variance

3. High Burst Tolerance

 Large buffer headroom → bursts fit
 Aggressive marking → sources react before packets are
dropped

21
DCTCP Deployments

21
Discussion

22
What You Said?
Austin: “The paper's performance comparison to RED
seems arbitrary, perhaps RED had traction at the time?
Or just convenient as the switches were capable of
implementing it?”

23
Evaluation
Implemented in Windows stack.
Real hardware, 1Gbps and 10Gbps experiments
– 90 server testbed
– Broadcom Triumph 48 1G ports – 4MB shared memory
– Cisco Cat4948 48 1G ports – 16MB shared memory
– Broadcom Scorpion 24 10G ports – 4MB shared memory

Numerous micro-benchmarks
– Throughput and Queue Length – Fairness and Convergence
– Multi-hop – Incast
– Queue Buildup – Static vs Dynamic Buffer Mgmt
– Buffer Pressure

Bing cluster benchmark

Bing Benchmark (baseline)
Background Flows Query Flows

25
Bing Benchmark (scaled 10x)
Incast
Deep buffers fix
Completion Time (ms)

incast, but increase

latency
DCTCP good for both
incast & latency

Query Traffic Short messages

(Incast bursts) (Delay-sensitive)
What You Said
Amy: “I find it unsatisfying that the details of many
congestion control protocols (such at these) are so
complicated! ... can we create a parameter-less
congestion control protocol that is similar in behavior to
DCTCP or TIMELY?”

Hongzi: “Is there a general guideline to tune the

parameters, like alpha, beta, delta, N, T_low, T_high, in
the system?”

27
A bit of Analysis
B K
How much buffering does DCTCP need for
100% throughput?
 Need to quantify queue size oscillations (Stability).

Packets sent in this

Window Size RTT are marked.

W*+1
W*

(W*+1)(1-α/2)

# of pkts in last RTT of Period

a=
# of pkts in Period Time
22
A bit of Analysis
B K
How small can queues be without loss of
throughput?
 Need to quantify queue size oscillations (Stability).

for TCP:
K > (1/7) C x RTT
K > C x RTT

What assumptions does the

model make?
22
What You Said
Anurag: “In both the papers, one of the difference I saw
from TCP was that these protocols don’t have the “slow
start” phase, where the rate grows exponentially
starting from 1 packet/RTT.”

30
Convergence Time
DCTCP takes at most ~40% more RTTs than TCP
– “Analysis of DCTCP: Stability, Convergence, and Fairness,” SIGMETRICS 2011

Intuition: DCTCP makes smaller adjustments than TCP, but makes

them much more frequently

TCP DCTCP

31
TIMELY

 Slides by Radhika Mittal (Berkeley)

Qualities of RTT
• Fine-grained and informative

• Quick response time

• No switch support needed

• End-to-end metric

• Works seamlessly with QoS

RTT correlates with queuing delay
What You Said
Ravi: “The first thing that struck me while reading these
papers was how different their approaches were. DCTCP even
states that delay-based protocols are "susceptible to noise in
the very low latency environment of data centers" and that
"the accurate measurement of such small increases in
queuing delay is a daunting task". Then, I noticed that there
is a 5 year gap between these two papers… “

Arman: “They had to resort to extraordinary measures to

ensure that the timestamps accurately reflect the time at
which a packet was put on wire…”
35
Accurate RTT Measurement
Hardware Assisted RTT Measurement
Hardware Timestamps
– mitigate noise in measurements

Hardware Acknowledgements
– avoid processing overhead
Hardware vs Software Timestamps

Kernel Timestamps introduce significant noise in RTT

measurements compared to HW Timestamps.
Impact of RTT Noise

Throughput degrades with increasing noise in RTT.

Precise RTT measurement is crucial.
TIMELY Framework
Overview
Data

RTT RTT Rate Rate

Measurement Computation Pacing Engine
Engine Engine

Timestamps
Paced
Data
RTT Measurement Engine

Serialization Delay RTT

tsend tcompletion
SENDER

Propagation &
Queuing Delay

RECEIVER
HW ack

RTT = tcompletion – tsend – Serialization Delay

Algorithm Overview

Gradient-based
Increase / Decrease
Algorithm Overview

Gradient-based
Increase / Decrease

gradient = 0
RTT

Time
Algorithm Overview

Gradient-based
Increase / Decrease

gradient > 0
RTT

Time
Algorithm Overview

Gradient-based
Increase / Decrease

gradient < 0
RTT

Time
Algorithm Overview

Gradient-based
Increase / Decrease
RTT

Time
Algorithm Overview

Gradient-based
Increase / Decrease

To navigate the
throughput-latency
tradeoff and
ensure stability.
Why Does Gradient Help Stability?
Source

e(t) =RTT (t) - RTT0

Source

e(t) + ke'(t)

Feedback higher order derivatives

Observe not only error, but change in error – “anticipate” future state
49
What You Said
Arman: “I also think that deducing the queue length
from the gradient model could lead to miscalculations.
For example, consider an Incast scenario, where many
senders transmit simultaneously through the same
path. Noting that every packet will see a long, yet
steady, RTT, they will compute a near-zero gradient and
hence the congestion will continue.”

50
Algorithm Overview

Additive Gradient-based Multiplicativ

Increase Increase / Decrease e Decrease
Tlow Thigh

To navigate the
To keep tail
Better Burst throughput-latency
latency within
Tolerance tradeoff and
acceptable limits.
ensure stability.
Discussion

52
Implementation Set-up
TIMELY is implemented in the context of RDMA.
– RDMA write and read primitives used to invoke NIC
services.

Priority Flow Control is enabled in the network fabric.

– RDMA transport in the NIC is sensitive to packet drops.
– PFC sends out pause frames to ensure lossless network.
“Congestion Spreading” in Lossless Networks

P
SE
PAU

A
U
PA

U
PA
US

SE
E
S E

PAUSE
PAUSE
U
PA PA
US
E
E

SE
PAUS

PAU
54
TIMELY vs PFC

55
TIMELY vs PFC

56
What You Said
Amy: “I was surprised to see that TIMELY performed so
much better than DCTCP. Did the lack of an OS-bypass
for DCTCP impact performance? I wish that the authors
had offered an explanation for this result.”

57
Next time: Load Balancing

58
59

DCTCP Performance in Data Centers
No ratings yet
DCTCP Performance in Data Centers
25 pages
DCTCP
No ratings yet
DCTCP
39 pages
Congestion Control: Issues
No ratings yet
Congestion Control: Issues
7 pages
Data Center TCP (DCTCP) Overview
No ratings yet
Data Center TCP (DCTCP) Overview
36 pages
Fast Transmission of TCP Packets
No ratings yet
Fast Transmission of TCP Packets
21 pages
Mideterm Cheatsheet
No ratings yet
Mideterm Cheatsheet
18 pages
Compound TCP
No ratings yet
Compound TCP
12 pages
Lex11 tcp2
No ratings yet
Lex11 tcp2
40 pages
Queue Management and QoS in Networks
No ratings yet
Queue Management and QoS in Networks
78 pages
CSC: Principles of Computer Networks: Demultiplexing
No ratings yet
CSC: Principles of Computer Networks: Demultiplexing
11 pages
TCP and IP Fundamentals - Part 2 (New)
No ratings yet
TCP and IP Fundamentals - Part 2 (New)
50 pages
A Survey of Delay Based and Hybrid TCP Congestion Control Algorithms
No ratings yet
A Survey of Delay Based and Hybrid TCP Congestion Control Algorithms
17 pages
سلايد ٥
No ratings yet
سلايد ٥
43 pages
TCP Flow Controls: Matthew Roughan Adelaide-Melbourne Grampians Workshop 1999
No ratings yet
TCP Flow Controls: Matthew Roughan Adelaide-Melbourne Grampians Workshop 1999
43 pages
1 Unit 1: Internet Architecture and Applications
No ratings yet
1 Unit 1: Internet Architecture and Applications
10 pages
Topics: Reliable (Qos), and Gives Standard Interface
No ratings yet
Topics: Reliable (Qos), and Gives Standard Interface
6 pages
Nishida Day1 2 in 1
No ratings yet
Nishida Day1 2 in 1
24 pages
22 - Trans - Congestion
No ratings yet
22 - Trans - Congestion
13 pages
MObile TCP - Udd
No ratings yet
MObile TCP - Udd
286 pages
DCC Final Micro-Project
No ratings yet
DCC Final Micro-Project
20 pages
FIT@HCMUS-Giao Duc 4.0-MMTNC - Bai2
No ratings yet
FIT@HCMUS-Giao Duc 4.0-MMTNC - Bai2
49 pages
CSCI-1680 Transport Layer III Congestion Control Strikes Back
No ratings yet
CSCI-1680 Transport Layer III Congestion Control Strikes Back
51 pages
Congestion Control For High Bandwidth-Delay Product Networks
No ratings yet
Congestion Control For High Bandwidth-Delay Product Networks
14 pages
TCP Protocol Overview and Management
No ratings yet
TCP Protocol Overview and Management
9 pages
Week5a One
No ratings yet
Week5a One
25 pages
Rfc6349 WP Tfs TM Ae
No ratings yet
Rfc6349 WP Tfs TM Ae
8 pages
CS640: Introduction To Computer Networks: Aditya Akella TCP - Iii Reliability and Implementation Issues
No ratings yet
CS640: Introduction To Computer Networks: Aditya Akella TCP - Iii Reliability and Implementation Issues
22 pages
TCP Congestion Control Techniques
No ratings yet
TCP Congestion Control Techniques
6 pages
TCP Optimization Techniques Explained
No ratings yet
TCP Optimization Techniques Explained
8 pages
Lecture 4 - Transport Layer
No ratings yet
Lecture 4 - Transport Layer
80 pages
TCP Congestion Control Overview
No ratings yet
TCP Congestion Control Overview
23 pages
DCC Micro Project
89% (9)
DCC Micro Project
10 pages
21 Congestion Avoidance 22-03-2024
No ratings yet
21 Congestion Avoidance 22-03-2024
36 pages
TCP/UDP Protocol Insights
No ratings yet
TCP/UDP Protocol Insights
26 pages
Optimizing TCPIP
No ratings yet
Optimizing TCPIP
25 pages
TCP Congestion Control
No ratings yet
TCP Congestion Control
20 pages
4.2 - 4.3 - TCP Congesation Control
No ratings yet
4.2 - 4.3 - TCP Congesation Control
108 pages
Cheat, Sheet, 2, CS5231, Monkey CHaeat
No ratings yet
Cheat, Sheet, 2, CS5231, Monkey CHaeat
8 pages
Advanced TCP Algorithms Guide
No ratings yet
Advanced TCP Algorithms Guide
26 pages
Introduction To CN-Parte-3
No ratings yet
Introduction To CN-Parte-3
48 pages
Transport Layer: Shashikant V. Athawale Assistant Professor Department of Computer Engineering, AISSMS COE, Pune
No ratings yet
Transport Layer: Shashikant V. Athawale Assistant Professor Department of Computer Engineering, AISSMS COE, Pune
45 pages
Congestion Control in TCP
No ratings yet
Congestion Control in TCP
24 pages
Network Congestion & QoS Guide
No ratings yet
Network Congestion & QoS Guide
27 pages
Congestion Control: Reading: Sections 6.1-6.4
No ratings yet
Congestion Control: Reading: Sections 6.1-6.4
39 pages
TCP Performance and Fairness Over Mobile Ad Hoc Networks: Seok-Hoon Yoon
No ratings yet
TCP Performance and Fairness Over Mobile Ad Hoc Networks: Seok-Hoon Yoon
41 pages
Nishida Day2 6in1
100% (2)
Nishida Day2 6in1
8 pages
RT Net
No ratings yet
RT Net
18 pages
Lab 4.4 Comparing Queuing Strategies
No ratings yet
Lab 4.4 Comparing Queuing Strategies
45 pages
6 TCP Congestion Control
No ratings yet
6 TCP Congestion Control
14 pages
Announcement: - Project 2 Finally Ready On Tlab - Homework 2 Due Next Mon Tonight - Midterm Next Th. in Class
No ratings yet
Announcement: - Project 2 Finally Ready On Tlab - Homework 2 Due Next Mon Tonight - Midterm Next Th. in Class
30 pages
Class 13 - Congestion
No ratings yet
Class 13 - Congestion
28 pages
Transp Layer1
No ratings yet
Transp Layer1
35 pages
Chapter Three Part - 2
No ratings yet
Chapter Three Part - 2
29 pages
TCP Estimated RTT and Timeout Settings
No ratings yet
TCP Estimated RTT and Timeout Settings
20 pages
TCP Variants for High-Speed Networks
No ratings yet
TCP Variants for High-Speed Networks
17 pages
TCP Tuning for High-Speed Networks
No ratings yet
TCP Tuning for High-Speed Networks
29 pages
Cloud Networking
No ratings yet
Cloud Networking
36 pages
AWS ELB & Auto Scaling Setup Guide
No ratings yet
AWS ELB & Auto Scaling Setup Guide
13 pages
LTE Network Evolution Guide
No ratings yet
LTE Network Evolution Guide
65 pages
Layer-2 Tunnel, That Can Be Bridge
No ratings yet
Layer-2 Tunnel, That Can Be Bridge
3 pages
Route Selection in Cisco Routers: Document ID: 8651
No ratings yet
Route Selection in Cisco Routers: Document ID: 8651
6 pages
IPSec VPN Setup: Mikrotik & Cisco
No ratings yet
IPSec VPN Setup: Mikrotik & Cisco
5 pages
GUL Strategy
75% (8)
GUL Strategy
53 pages
Parrot OS Tools
No ratings yet
Parrot OS Tools
56 pages
Network Diagnostics with MTR
No ratings yet
Network Diagnostics with MTR
7 pages
Jncia-Jncip (SP) Lab Manual: Router Initial Configuration (Ipv4/Ipv6)
No ratings yet
Jncia-Jncip (SP) Lab Manual: Router Initial Configuration (Ipv4/Ipv6)
89 pages
MP Mpls Ping
No ratings yet
MP Mpls Ping
44 pages
Bridging Remote Networks with MikroTik
No ratings yet
Bridging Remote Networks with MikroTik
5 pages
Lab 2 Ospf Apin v1-1
No ratings yet
Lab 2 Ospf Apin v1-1
8 pages
Lecture 10
No ratings yet
Lecture 10
39 pages
Lab 1 - Configure IPv4 and IPv6 Static and Default Routes
No ratings yet
Lab 1 - Configure IPv4 and IPv6 Static and Default Routes
6 pages
4
No ratings yet
4
5 pages
Mimo (Ran19.1 01) PDF
No ratings yet
Mimo (Ran19.1 01) PDF
135 pages
JNCIA-Junos JN0-103 Exam Guide
No ratings yet
JNCIA-Junos JN0-103 Exam Guide
90 pages
EIGRP and OSPF Networking Protocols Guide
No ratings yet
EIGRP and OSPF Networking Protocols Guide
17 pages
CIDR
No ratings yet
CIDR
1 page
Adaptive Dual Layer Beamforming SEO
100% (2)
Adaptive Dual Layer Beamforming SEO
3 pages
TNN0081SRT09
No ratings yet
TNN0081SRT09
17 pages
Address Aggregation / Route Summarization / Supernetting: Step 1
No ratings yet
Address Aggregation / Route Summarization / Supernetting: Step 1
1 page
Kpi Counter and Cause of Problem
100% (1)
Kpi Counter and Cause of Problem
17 pages
Laboratory Practice 2.5.1 - Basic PPP Configuration
No ratings yet
Laboratory Practice 2.5.1 - Basic PPP Configuration
32 pages
Delivery, Forwarding, Routing
No ratings yet
Delivery, Forwarding, Routing
47 pages
2.3.11 Packet Tracer - Determine The DR and BDR
No ratings yet
2.3.11 Packet Tracer - Determine The DR and BDR
3 pages
IPv4 Subnet Design Lab Guide
No ratings yet
IPv4 Subnet Design Lab Guide
6 pages
4.4.8 Packet Tracer - Troubleshoot Inter-VLAN Routing - ILM
No ratings yet
4.4.8 Packet Tracer - Troubleshoot Inter-VLAN Routing - ILM
2 pages
Hierarchical Routing & IP Protocols
No ratings yet
Hierarchical Routing & IP Protocols
22 pages
Static Routing in Advanced Networks
100% (1)
Static Routing in Advanced Networks
4 pages

6.888: Data Center Congestion Control: Mohammad Alizadeh

Uploaded by

6.888: Data Center Congestion Control: Mohammad Alizadeh

Uploaded by

6.

Interconnect for distributed compute workloads

Challenging traffic patterns

Worker 4 TCP timeout

 Vasudevan et al. (SIGCOMM’09)

Requests are jittered over 10ms window.

3. High Burst Tolerance

The challenge is to achieve these together

Baseline fabric latency (propagation + switching): 10 microseconds

Baseline fabric latency (propagation + switching): 10 microseconds

High throughput requires buffering for rate mismatches

TCP [Jacobsen et al.’88] is widely used in the data center

Operators work around TCP problems

Practical deployment is hard

ECN Mark (1 bit)

ECN = Explicit Congestion Notification

Bandwidth-delay product rule of thumb:

B < C×RTT B ≥ C×RTT

Can’t rely on stat-mux benefit in the DC.

ECN Marks TCP DCTCP

0000000001 Cut window by 50% Cut window by 5%

Time (sec) Time (sec)

 Adaptive window decreases: 

ECN Marking Thresh = 30KB

large buffer headroom

3. High Burst Tolerance

Bing cluster benchmark

incast, but increase

Query Traffic Short messages

Hongzi: “Is there a general guideline to tune the

Packets sent in this

# of pkts in last RTT of Period

What assumptions does the

Intuition: DCTCP makes smaller adjustments than TCP, but makes

 Slides by Radhika Mittal (Berkeley)

• Quick response time

• No switch support needed

• Works seamlessly with QoS

Arman: “They had to resort to extraordinary measures to

Kernel Timestamps introduce significant noise in RTT

Throughput degrades with increasing noise in RTT.

RTT RTT Rate Rate

Serialization Delay RTT

RTT = tcompletion – tsend – Serialization Delay

e(t) =RTT (t) - RTT0

Feedback higher order derivatives

Additive Gradient-based Multiplicativ

Priority Flow Control is enabled in the network fabric.

You might also like