UDP-based schemes for High Speed Networks
Presented By : Sumitha Bhandarkar
Presented On : 03.24.04
Agenda
• RBUDP
– E. He, J. Leigh, O. Yu, T. A. DeFanti, “Reliable Blast UDP : Predictable High Performance Bulk Data
Transfer”, IEEE Cluster Computing 2002, Chicago, Illinois, Sept 2002.
• Tsunami (No technical resources available)
– http://www.ncne.org/training/techs/2002/0728/presentations/200207-wallace1_files/v3_document.htm
• SABUL/UDT
– H. Sivakumar, R. L. Grossman, M. Mazzucco, Y. Pan, Q. Zhang, “Simple Available Bandwidth Utilization
Library for High-Speed Wide Area Networks”, to appear in Journal of Supercomputing, 2004.
– Y. Gu and R. Grossman, “UDT: An Application Level Transport Protocol for Grid Computing”, Second
International Workshop on Protocols for Fast Long-Distance Networks, February 2004 (PFLDnet 2004).
– Y. Gu and R. Grossman, “UDT: A Transport Protocol for Data Intensive Applications”, IETF DRAFT.
http://bebas.vlsm.org/v08/org/rfc-editor/internet-drafts/draft-gg-udt-00.txt
• GTP
– R.X. Wu and A.A. Chien, “GTP: Group Transport Protocol for Lambda-Grids”, 4th IEEE/ACM
International Symposium on Cluster Computing and the Grid, April 2004. (CCGrid 2004)
2
TCP based Schemes
The problems
• Slow Startup
• Slow loss recovery
• RTT bias
• Burstiness caused by window control
• Large amount of “control traffic” due to per-packet ack
. 3
RBUDP
• Intended to be aggressive.
• Intended for high bandwidth dedicated or QOS enabled networks - not
for deployment on the broader internet.
• Uses UDP for data traffic and TCP for signaling traffic.
• Estimates available bandwidth on the network using Iperf/app_perf
(NOTE : this requires user interaction ie, NOT automated …)
• Tries to send just below this rate in “blasts” to avoid losses (payload =
RTT * Estimated BW)
• If losses do occur within a “blast”, TCP is used to exchange loss reports
• Lost packets are recovered by retransmitting the lost packets in smaller
“blasts”
4
RBUDP
E. He, J. Leigh, O. Yu, T. A. DeFanti, “Reliable Blast UDP : Predictable High Performance Bulk Data 5
Transfer”, IEEE Cluster Computing 2002, Chicago, Illinois, Sept 2002.
RBUDP
Sample Results (with network bottleneck)
E. He, J. Leigh, O. Yu, T. A. DeFanti, “Reliable Blast UDP : Predictable High Performance Bulk Data 6
Transfer”, IEEE Cluster Computing 2002, Chicago, Illinois, Sept 2002.
RBUDP
Sample Results (with receiver bottleneck)
E. He, J. Leigh, O. Yu, T. A. DeFanti, “Reliable Blast UDP : Predictable High Performance Bulk Data 7
Transfer”, IEEE Cluster Computing 2002, Chicago, Illinois, Sept 2002.
RBUDP
Conclusions
Advantages
• Keeps the pipe as full as possible
• Avoid TCP’s per-packet ack interaction
• Paper provides analytical model- so performance is “predictable”
Disadvantages
• Sending rate needs to be adjusted by the user (no means of automatically
adjusting sending rate in response to the dynamic network conditions)
-Thus the solution is good ONLY in dedicated/QOS supported networks.
• No flow control - a fast sender can flood a slow receiver. Offered
solution is to use app_perf (modified Iperf developed by the authors to
take into account the receiver bottleneck) for bandwidth estimation.
8
Tsunami
• No tech papers. This info is from a presentation at July 2002
NLANR/Internet2 Techs Workshop. Available for download at
http://www.indiana.edu/~anml/anmlresearch.html. Latest version is dated
12/09/02
• Very simple and primitive scheme - NOT TCP-FRIENDLY
• Application level protocol - uses UDP for data and TCP for signaling
• Receiver keeps track of lost packets and requests for retransmission
• So how is this different from RBUDP ?
9
SABUL / UDT
• SABUL (Simple Available Bandwidth Utilization Library) uses UDP to
transfer data and TCP to transfer control information.
• UDT (UDP-based Data Transfer Protocol) uses UDP only for both data
and control information.
• UDT is the successor to SABUL.
• Both are application level protocols available as open source C++
library on Linux/BSD/Solaris and NS-2 simulation modules.
10
SABUL / UDT
• Rate control : for handling dynamic congestion - uses constant rate
control interval (called SYN - set to 0.01 seconds) to avoid RTT bias.
• Window based flow control : used in slow start, to ensure that fast
sender does not swamp a slow receiver, to limit unacknowledged pkts.
• Selective positive acknowledgement (one per SYN) and immediate
negative acknowledgement.
• Uses both packet loss and packet delay for inferring congestion
• TCP Friendly - less aggressive than TCP in low BDP networks ; better
than TCP in higher BDP networks.
• PFLDNet 2004 claim : Orthogonal Design - The UDP based framework
can be used with any congestion control algorithm and the UDT
congestion control algorithm can be ported to any TCP implementation.
11
SABUL / UDT
Y. Gu and R. Grossman, “UDT: An Application Level Transport Protocol for Grid Computing”, PFLDnet2004.
12
SABUL / UDT
Rate Control (AIMD)
• Increase
– If loss rate during the last SYN is less than a threshold (0.1%)
sending rate is increased.
– Old version (SABUL) :
– New version (UDT) :
– Estimated BW calculated using packet-pair technique
– Every 16th data packet and its successor are sent back to back
to form packet pair
– Receiver uses median filter on interval between arrival times of
each packet pair to estimate link capacity
Y. Gu, X. Hong, M. Mazzucco and R. Grossman, “SABUL: A High Performance Data Transfer Protocol “,
Submitted for publication. 13
Y. Gu and R. Grossman, “UDT: An Application Level Transport Protocol for Grid Computing”, PFLDnet2004.
SABUL / UDT
Rate Control (AIMD)
Y. Gu and R. Grossman, “UDT: An Application Level Transport Protocol for Grid Computing”, PFLDnet2004.
14
SABUL / UDT
Rate Control (AIMD)
• Decrease
– increase inter-packet time by 1/8 (or equivalently, decrease sending
rate by 1/9) for one of these conditions -
– if largest lost seq no. in NAK is greater than the largest sent
sequence number when last decrease occurred
– if it is the 2dec_countth NAK since last time the above condition is
satisfied. dec_count is reset to 4 each time the first condition is
satisfied, and incremented by 1 each time the second condition is
satisfied.
– delay warning is received
– Loss information carried in NAK are also compressed, for loss of
consecutive packets.
– No data is sent in the next SYN time after a decrease
– Delay warning is generated by the rcvr based on observed RTT 15
trend
SABUL / UDT
Rate Control (AIMD)
• Flow Control
– Receiver calculates the packet arrival rate (AS) using a median
filter and sends it back with the ACK
– On sender side if the AS value in the ack is greater than 0, then
window is updated as
– During congestion loss reports can be dropped or delayed. If sender
keeps sending new packets, it worsens congestion. Flow control helps
prevent this.
– Flow control also used in the slow start phase
– starts with flow window of 2
– similar to TCP
– only beginning of a new session.
Y. Gu and R. Grossman, “UDT: An Application Level Transport Protocol for Grid Computing”, PFLDnet2004.
16
SABUL / UDT
Timers
• SYN timer - trigger rate control event (fixed at 0.01s)
• SND timer - schedule the data packet sending (updated by rate control
scheme)
• ACK timer - trigger an ACK. (same as SYN interval)
• NAK timer - Used to trigger a NAK. Its interval is updated to the
current RTT value each time the SYN timer is expired.
• EXP timer - Used to trigger data packets retransmission and maintain
connection status. It is somewhat similar to the TCP RTO.
17
SABUL / UDT
Simulation Results
100Mbps/
1ms
1Gbps/
100ms
Y. Gu and R. Grossman, “Using UDP for Reliable Data Transfer over High Bandwidth-Delay Product
Networks”, submitted for publication. 18
SABUL / UDT
Simulation Results
7 concurrent
flows
100Mbps
bottleneck link
Y. Gu and R. Grossman, “Using UDP for Reliable Data Transfer over High Bandwidth-Delay Product
Networks”, submitted for publication. 19
SABUL / UDT
Simulation Results
Y. Gu and R. Grossman, “Using UDP for Reliable Data Transfer over High Bandwidth-Delay Product
Networks”, submitted for publication. 20
SABUL / UDT
Simulation Results
Y. Gu and R. Grossman, “Using UDP for Reliable Data Transfer over High Bandwidth-Delay Product
Networks”, submitted for publication. 21
SABUL / UDT
Real Implementation Results
Y. Gu and R. Grossman, “Using UDP for Reliable Data Transfer over High Bandwidth-Delay Product
Networks”, submitted for publication. 22
SABUL / UDT
Real Implementation Results
1Gbps/
40us
Y. Gu and R. Grossman, “Using UDP for Reliable Data Transfer over High Bandwidth-Delay Product
Networks”, submitted for publication. 23
SABUL / UDT
Real Implementation Results
1Gbps/ 110ms
I-TCP = TCP with
concurrent UDT flows
S-TCP = TCP without
concurrent UDT flows
Y. Gu and R. Grossman, “Using UDP for Reliable Data Transfer over High Bandwidth-Delay Product
Networks”, submitted for publication. 24
SABUL / UDT
Real Implementation Results
Y. Gu and R. Grossman, “Using UDP for Reliable Data Transfer over High Bandwidth-Delay Product
Networks”, submitted for publication. 25
SABUL / UDT
Conclusions
• From one of the SLAC talks 1 - “Looks good, BUT 4*CPU Utilization of
TCP”
• Reordering robustness worse than TCP - all out-of-order packets are treated
as losses. Suggested solution is to delay NAK reports by a short delay.
• All losses are treated as congestion - bad performance at high link error
rates. (Better than TCP though, since it does not respond to each and every
loss event).
• Router queue size is maintained smaller compared to TCP due to less
burstiness.
• Increase algorithm relies on bandwidth estimation - may not be suitable for
links with large number of concurrent flows.
1
http://www.slac.stanford.edu/grp/scs/net/talk03/pfld-feb04.ppt
26
GTP
Group Transport Protocol
• Motivated by the following observations about lambda grids
– Very high speed (1Gig, 10Gig etc.) dedicated links connecting
small number of end points (eg 103, and not 108) and possibly long
delays (eg. 60ms between experimental sites)
– Communication patterns not necessarily just point-to-point ;
multipoint-to-point and multipoint-to-multipoint very likely.
– Aggregate capacity of multiple connections could be far greater
than data handling speed of end system end point congestion far
more likely than network congestion
27
GTP
Overview
• Receiver-driven (dumb sender, very smart receiver)
• Request-response data transfer model
• Rate-based explicit flow control
• Receiver-centric max-min fair allocation across multiple flows
(irrespective of individual RTTS)
• UDP for data, TCP for control connection.
28
GTP
Framework
R.X. Wu and A.A. Chien, “GTP: Group Transport Protocol for Lambda-Grids”, 4th IEEE/ACM International
Symposium on Cluster Computing and the Grid, April 2004. (CCGrid 2004) 29
GTP
Framework (cont.)
• Single Flow Controller (SFC) : manages sending data packet requests,
chooses/requests sending rate, manages receiver buffer requirements
• Single Flow Monitor (SFM) : Measures flow statistics such as allocated
rate, achieved rate, packet loss rate, rtt estimate etc, which will be used by
both SFC and CE
• Capacity Estimator (CE) : Estimates flow capacity for each individual
flow based on statistics from SFM
• Max-min Fairness Scheduler : Estimates max-min fair share for each
individual flow
30
GTP
Flow Control and Rate Allocation
• Single Flow Controller (SFC) :
– flow rate adjusted per RTT
– loss proportional-decrease and proportional-increase for rate adaptation
• Capacity Estimator (CE) :
– flow rate adjusted per centralized control interval (default 3*RTTmax)
– Exponential Increase and loss proportional-decrease
R.X. Wu and A.A. Chien, “GTP: Group Transport Protocol for Lambda-Grids”, 4th IEEE/ACM International
Symposium on Cluster Computing and the Grid, April 2004. (CCGrid 2004) 31
GTP
Flow Control and Rate Allocation (cont.)
• Target rate for each flow is
•Max-min Fairness Scheduler adjusts the target flow rate to ensure max-min
fairness
R.X. Wu and A.A. Chien, “GTP: Group Transport Protocol for Lambda-Grids”, 4th IEEE/ACM International
Symposium on Cluster Computing and the Grid, April 2004. (CCGrid 2004) 32
GTP
Other Details
• Current implementation expects in-order deliver. Can be augmented in
future for handling out-of-order packets.
• TCP-Friendliness is “tunable” by allocating a fixed share of the total
bandwidth for TCP in the CE
• Currently congestion detection is only loss based. Future work will
augment the algorithm to include delay-based congestion detection.
• Transition management ensures max-min fairness is maintained even
when flows join/leave dynamically.
R.X. Wu and A.A. Chien, “GTP: Group Transport Protocol for Lambda-Grids”, 4th IEEE/ACM International
Symposium on Cluster Computing and the Grid, April 2004. (CCGrid 2004) 33
GTP
Simulation Results
R.X. Wu and A.A. Chien, “GTP: Group Transport Protocol for Lambda-Grids”, 4th IEEE/ACM International
Symposium on Cluster Computing and the Grid, April 2004. (CCGrid 2004) 34
GTP
Simulation Results (Cont.)
R.X. Wu and A.A. Chien, “GTP: Group Transport Protocol for Lambda-Grids”, 4th IEEE/ACM International
Symposium on Cluster Computing and the Grid, April 2004. (CCGrid 2004) 35
GTP
Simulation Results (Cont.)
R.X. Wu and A.A. Chien, “GTP: Group Transport Protocol for Lambda-Grids”, 4th IEEE/ACM International
Symposium on Cluster Computing and the Grid, April 2004. (CCGrid 2004) 36
GTP
Emulation Results
R.X. Wu and A.A. Chien, “GTP: Group Transport Protocol for Lambda-Grids”, 4th IEEE/ACM International
Symposium on Cluster Computing and the Grid, April 2004. (CCGrid 2004) 37
GTP
Emulation Results (Cont.)
R.X. Wu and A.A. Chien, “GTP: Group Transport Protocol for Lambda-Grids”, 4th IEEE/ACM International
Symposium on Cluster Computing and the Grid, April 2004. (CCGrid 2004) 38
GTP
Real Implementation Results
R.X. Wu and A.A. Chien, “GTP: Group Transport Protocol for Lambda-Grids”, 4th IEEE/ACM International
Symposium on Cluster Computing and the Grid, April 2004. (CCGrid 2004) 39
GTP
Real Implementation Results (Cont.)
R.X. Wu and A.A. Chien, “GTP: Group Transport Protocol for Lambda-Grids”, 4th IEEE/ACM International
Symposium on Cluster Computing and the Grid, April 2004. (CCGrid 2004) 40
GTP
Real Implementation Results (Cont.)
R.X. Wu and A.A. Chien, “GTP: Group Transport Protocol for Lambda-Grids”, 4th IEEE/ACM International
Symposium on Cluster Computing and the Grid, April 2004. (CCGrid 2004) 41
Questions ???
42
Extra Slides
43
Scatter/Gather DMA
• Optimization for improving network stack processing
• Under normal circumstances, data is copied between kernel and app memory
• This is required because the network device drivers read/write contiguous
memory locations, whereas applications use mapped virtual memory
• When the NIC drivers are capable of scatter/gather DMA a scatter/gather list
is maintained so that the NICS can do direct read/write to the final memory
location where the data is intended to go. The scatter/gather data structure
makes the memory look contiguous to the NID drivers
• All protocol processing is done by reference. Eliminating the memory copy
has shown to improve performance dramatically
• In practice, the process is a little more complicated. At the send side copy-
on-write should be enforced so that packets sent out but not acknowledged are
not overwritten. At the recv side, page borders should be enforced ….
44
Packet Pair BW Estimation
• Two packets of same size (L) are transmitted back to back
• Bottleneck link capacity (C) is smaller than the capacity of all other the
links (by definition)
• Packets face “transmission delay” at the bottleneck link
• As a result at the receiver they arrive with larger inter-packet delay than
when they were sent
• This delay can be used to computer the bottleneck link capacity
• (Makes lots of assumptions. Also will work only with FIFO queuing)
45