NetProfiler: Profiling Wide-Area Networks Using Peer Cooperation
Venkata N. Padmanabhan Sriram Ramabhadran * Jitendra Padhye
Microsoft Research UC San Diego Microsoft Research
Abstract— Our work is motivated by two observations about This information can be used to drive decisions such as
the state of networks today. Operators have little visibility upgrading to a higher level of service (e.g., to 768 Kbps
into the end users’ network experience while end users have DSL from 128 Kbps service) or switching ISPs.
little information or recourse when they encounter problems.
We propose a system called NetProfiler, in which end hosts ¯ A consumer ISP such as MSN can monitor the perfor-
share network performance information with other hosts over mance seen by its customers in various locations and
a peer-to-peer network. The aggregated information from mul- identify, for instance, that the customers in a certain
tiple hosts allows NetProfiler to profile the wide-area network, city are consistently underperforming those elsewhere.
i.e., monitor end-to-end performance, and detect and diagnose This can call for upgrading the service or switching to a
problems from the perspective of end hosts. We define a set
of attribute hierarchies associated with end hosts and their different provider of modem banks, backhaul bandwidth,
network connectivity. Information on the network performance etc. in that city.
and failures experienced by end hosts is then aggregated along We view NetProfiler as an interesting and novel P2P ap-
these hierarchies, to identify patterns (e.g., shared attributes) that plication that leverages peers for network monitoring and
might be indicative of the source of the problem. In some cases,
such sharing of information can also enable end hosts to resolve
diagnosis. Peer participation is critical in NetProfiler, since in
problems by themselves. The results from a 4-week-long Internet the absence of such participation, it would be difficult to learn
experiment indicate the promise of this approach. the end-host perspective from multiple vantage points. This
is in contrast to traditional P2P applications such as content
I. I NTRODUCTION distribution, where it is possible to reduce or eliminate depen-
Our work is motivated by two observations about the state dence on peers by employing a centralized infrastructure. Each
of networks today. First, operators have little direct visibility end-host is valuable in NetProfiler because of the perspective
into the end users’ network experience. Monitoring of network it provides on the health of the network, and not because
routers and links, while important, does not translate into direct of the (minimal) resources such as bandwidth and CPU that
knowledge of the end-to-end health of the network. This is it contributes. Clearly, the usefulness and effectiveness of
because any single operator usually controls only a few of NetProfiler grows with the size of the deployment. In practice,
the components along an end-to-end path. On the other hand, NetProfiler can either be deployed in a coordinated manner
although end users have direct visibility into their own network by a network operator such as a consumer ISP or the IT
performance, they have little other information or recourse department of an enterprise, or can grow organically as an
when they encounter problems. They do not know the cause increasing number of users install this new P2P “application”.
of the problem or whether it is affecting other users as well. To put NetProfiler in perspective, the state-of-the-art in end-
To address these problems, we propose a system called host-based network diagnosis is an individual user using tools
NetProfiler, in which end hosts monitor the network perfor- such as ping and traceroute to investigate problems. However,
mance and then share the information with other end hosts this approach suffers from several drawbacks.
over a peer-to-peer network. End hosts, or “clients”, are in the A key limitation of these tools is that they only capture
ideal position to do monitoring since they are typically the information from the viewpoint of a single end host or network
initiators of end-to-end transactions and have full visibility entity. Also, these tools only focus on entities such as routers
into the success or failure of the transactions. By examining and links that are on the IP-level path, whereas the actual cause
the correlations, or the lack thereof, across observations made of a problem might be higher-level entities such as proxies and
by different clients, NetProfiler can detect network anomalies servers. In contrast, NetProfiler considers the entire end-to-end
and localize their likely cause. Besides anomaly detection and transaction, and combines information from multiple vantage
diagnosis, this system allows users (and also ISPs) to learn points, which enables better fault diagnosis.
about the network performance experienced by other hosts. Many of the existing tools also operate on a short time
The following scenarios illustrate the use of NetProfiler: scale, usually on an as-needed basis. NetProfiler monitors,
¯ A user who is unable to access a web site can find out aggregates, and summarizes network performance data on a
whether the problem is specific to his/her host or ISP, continuous basis. This allows NetProfiler to detect anomalies
or whether it is a server problem. In the latter case, in performance based on historical comparisons.
the user’s client may be able to automatically discover Another important issue is that many of the tools rely
working replicas of the site. on active probing. In contrast, NetProfiler relies on passive
¯ A user can benchmark his/her long-term network per- observation of existing traffic. Reliance on active probing is
formance against that of other users in the same city. problematic due to several reasons. First, the overhead of
active probing can be high, especially if hundreds of millions
* The author was an intern at Microsoft Research during part of this work. of Internet hosts start using active probing on a routine basis.
1
Second, active probing cannot always disambiguate the cause during slow start. 1 Using estimates of the RTT, cwnd and
of failure. For example, an incomplete traceroute could be bottleneck bandwidth, we can determine the likely cause of
due to a router or server failure, or simply because of the rate limitation: whether the application itself is not producing
suppression of ICMP messages by a router or a firewall. enough data or whether an external factor such as a bandwidth
Third, the detailed information obtained by client-based active bottleneck or packet loss is responsible.
probing (e.g., traceroute) may not pertain to the dominant Our initial experiments indicate that the TcpScope heuristics
direction of data transfer (typically server client). perform well. In ongoing work, we are conducting more
Thus we believe that it is important and interesting to extensive experiments in wide-area settings.
consider strategies for monitoring and diagnosing network WebScope: In certain settings such as enterprise networks,
performance that do not rely on active probing, and take a the clients’ web connections might traverse a caching proxy.
broad view of the network by considering the entire end-to- So TcpScope would only be able to observe the dynamics of
end path rather than just the IP-level path and combining the the network path between the proxy and the client. To provide
view from multiple vantage points. some visibility into the conditions on the network path beyond
In the remainder of the paper, we discuss the architecture of the proxy, we have implemented the WebScope sensor. For an
NetProfiler, some details of its constituent components, open end-to-end web transaction, WebScope is able to estimate the
issues and challenges, and related work. contributions of the proxy, the server, and the server–proxy and
proxy–client network paths to the overall latency. The main
II. N ET P ROFILER A RCHITECTURE AND A LGORITHMS
idea is to use a combination of cache-busting and byte-range
We now discuss the architecture of NetProfiler and the HTTP requests, to decompose the end-to-end latency.
algorithms used for the acquisition, aggregation, and analysis WebScope produces less detailed information than Tcp-
of network performance data. Scope but still offers a rough indication of the performance
A. Data Acquisition of the individual components on the client-proxy-server path.
WebScope focuses on the first-level proxy between the client
Data acquisition is performed by sensors, which are soft- and the origin server. It ignores additional intermediate prox-
ware modules residing on end hosts such as users’ desktop ies, if any. This is just as well since such proxies are typically
machines. Although these sensors could perform active mea- not visible to the client and so the client does not have the
surements, our focus here is primarily on passive observa- option of picking between multiple alternatives. Finally, we
tion of existing traffic. The end host would typically have note that WebScope can operate in a “pseudo passive” mode
multiple sensors, say one for each protocol or application. by manipulating the cache control and byte-range headers on
Sensors could be defined for the common Internet protocols existing HTTP requests.
such as TCP, HTTP, DNS, and RTP/RTCP as well protocols
that are likely to be of interest in specific settings such as B. Normalization
enterprise networks (e.g., the RPC protocol used by Microsoft The data produced by the sensors at each node needs to
Exchange servers and clients). The goal of the sensors is be “normalized” before it can be meaningfully shared with
both to characterize the end-to-end communication in terms other nodes. For instance, the throughput observed by a dialup
of success/failure and performance, and also to infer the client might be consistently lower that that observed by a LAN
conditions on the network path. client at the same location and yet this does not represent an
We have implemented two simple sensors — TcpScope and anomaly. On the other hand, the failure to download a page is
WebScope — to analyze TCP and HTTP, respectively. The information that can be shared regardless of the client’s access
widespread use of these protocols makes these sensors very link speed.
useful. We now describe them briefly.
We propose dividing clients into a few different bandwidth
TcpScope: TcpScope is a passive sensor that listens on
classes based on their access link (downlink) speed — dialup,
TCP transfers to and from the end host, and attempts to
low-end broadband (say under 250 Kbps), high-end broadband
determine the cause of any performance problems. Our current
(say under 1.5 Mbps), and LAN (10 Mbps and above). Clients
implementation operates at user level in conjunction with the
could determine their bandwidth class either based on the
NetMon or WinDump filter driver on Windows XP. Since
estimates provided by TcpScope or based on out-of-band
the user’s machine is typically at the receiving end of TCP
information (e.g., user knowledge).
connections, it is challenging to estimate metrics such as the
The bandwidth class of a node is included in its set of
connection’s RTT, congestion window size, etc. We outline
attributes for the purposes of aggregating certain kinds of
a set of heuristics that are inspired by T-RAT [21] but are
information using the procedure discussed in Section II-C.
simpler since we have access to the client host.
Information of this kind includes the TCP throughput and
An initial RTT sample is obtained from the SYN-SYNACK
possibly also the RTT and the packet loss rate. For TCP
exchange. Further RTT samples are obtained by identifying
throughput, we use the information inferred by TcpScope
flights of data separated by idle periods during the slow-start
to filter out measurements that were limited by factors such
phase. The RTT estimate can be used to obtain an estimate
as the receiver-advertised window or the connection length.
of sender’s congestion window (cwnd). A rough estimate
of the bottleneck bandwidth is obtained by observing the 1 We can determine whether two packets were likely sent back-to-back by
spacing between the pairs of back-to-back packets emitted the sender by examining their IP IDs.
2
Regarding the latter, the throughput corresponding to the not hierarchical, in the case of replicated sites, destination
largest window (i.e., flight) that experienced no loss is likely site can be further refined based on the actual replica
to be more meaningful than the throughput of the entire being accessed.
connection. ¯ Bandwidth class: Filtering based on bandwidth class is
Certain information such as RTT is strongly influenced by a useful for users to compare their performance with other
client’s location. So it is meaningful to share this information users within the same class (e.g. “How are all dialup users
only with clients at the same location (e.g., same city). faring?”) , as well as in other classes (“What performance
Certain other information can be aggregated across all can I expect if I switch to DSL?”).
clients regardless of their location or access link speed. Ex- Aggregation based on attributes such as location is done
amples include the success or failure of page download and in a hierarchical manner, with the aggregation tree mirroring
an indiction of server or proxy load obtained from TcpScope the logical hierarchy defined by the attribute space. This is
or WebScope. based on the observation that nodes are typically interested
Finally, certain sites may have multiple replicas, with clients in detailed information only from “nearby” peers. They are
in different parts of the network communicating with different satisfied with more aggregated information about distant peers.
replicas. As such it make sense to report detailed performance For instance, while a node might be interested in specific
information on a per replica basis and also report less detailed information, such as the download performance from a popular
information (e.g., just an indication of download success or web site, pertaining to peers in its neighborhood, it has
failure) on a per-site basis. The latter information would enable little use for such detailed information from nodes across the
clients connected to a poorly performing replica to discover country. Regarding the latter, it is likely to be interested only in
that the site is accessible via other replicas. an aggregated view of the performance experienced by clients
in the remote city or region.
C. Data Aggregation
Non-hierarchical attributes such as bandwidth class and
We now discuss how the performance information gathered destination site are used as filters that qualify performance
at the individual end hosts is shared and aggregated across data as it aggregated up the logical hierarchy described above.
nodes. Our approach is based on a decentralized peer-to- For example, each node in the hierarchy may organize the
peer architecture, which spreads the burden of aggregating performance reports it receives based on bandwidth class,
information across all nodes. destination site and perhaps the cross-product. This enables
The process of data aggregation and analysis is performed the system to provide more fine-grained performance trends
based on a set of client attributes. For both fault isolation (e.g., “What is the performance seen by dialup clients in
and comparative analysis, it is desirable to compare the Seattle when accessing [Link]?”). Conceptually, this
performance of clients that share certain attributes, as well is similar to maintaining different aggregation trees for each
as those that differ in certain attributes. Attributes may be combination of attributes; in practice, it is desirable to realize
hierarchical, in which case they define a logical hierarchy this in a single hierarchy as it limits the number of times an
along which performance data can be aggregated. Examples end-host has to report the same performance record. Since the
of hierarchical attributes are number of bandwidth classes is small, it is feasible to maintain
¯ Geographical location: Aggregation based on location separate hierarchies for each class. However, with destination
is useful for users and network operators to detect sites, this is done only for a manageable number of popular
performance trends specific to a particular location sites. For less popular sites, it may be infeasible to maintain
(e.g. “How are users in the Seattle area performing?”). per-site trees, so only a single aggregated view of the site is
Location yields a natural aggregation hierarchy, e.g., maintained, at the cost of losing the ability to further refine
neighborhood city region country. based on other attributes.
¯ Topological location: Aggregation based on topological Finally, mechanisms are required to map the above logical
location is useful for users to make informed choices aggregation hierarchies to a physical hierarchy of nodes. To
regarding their service provider (e.g., “Is my local ISP the this end, we leverage DHT-based aggregation techniques such
reason for the poor performance I am seeing?”). It is also as SDIMS [19], which exploits the natural hierarchy yielded
useful for network providers to identify performance bot- by the connectivity structure of the DHT nodes. Aggrega-
tlenecks in their networks. Topological location can also tion happens in a straightforward manner: nodes maintain
be aggregated along a hierarchy, e.g., subnet PoP ISP. information on the performance experienced by clients in
Alternatively, attributes can be non-hierarchical, in which their subtree. Periodically, they report aggregated views of
case they are used to filter performance data to better analyze this information to their parent. Such a design results in
trends specific to that particular attribute. Examples of non- good locality properties, ensures efficiency of the aggregation
hierarchical attributes include: hierarchy, and minimizes extraneous dependencies (e.g., the
¯ Destination site: Filtering based on destination site is
aggregator node for a client site lies within the same site).
useful to provide information on whether other users
D. Analysis and Diagnosis
are able to access a particular website, and if so, what
performance they are seeing (e.g. “Are other users also We now discuss the kinds of analyses and diagnoses that
having problems accessing [Link]?”). Although NetProfiler enables.
3
1) Distributed Blame Allocation: Clients that are expe- 3) Network Engineering Analysis: A network operator
riencing poor performance can diagnose the problem using could use detailed information gleaned from clients to make
a procedure that we term as distributed blame allocation. an informed decision on how to re-engineer or upgrade the
Conceptually, the idea is for a client to ascribe the poor per- network. For instance, consider the IT department of a large
formance that it is experiencing to the entities involved in the global enterprise that is tasked with provisioning network
end-to-end transaction. The entities could include the server, connectivity for dozens of corporate sites spread across the
proxy, DNS 2 , and the network path, where the resolution of the globe. There is a plethora of choices in terms of connectivity
path would depend on the information available (e.g., the full options (ranging from expensive leased lines to the cheaper
AS-level path or simply the ISP/PoP that the client connects VPN over the public Internet alternative), service providers,
to). The simplest policy is for a client to ascribe the blame bandwidth, etc. The goal is typically to balance the twin goals
equally to all of the entities. But a client could assign blame of low cost and good performance. While existing tools and
unequally if it suspects certain entities more, say based on methodologies (based say on monitoring link utilization) are
information gleaned from local sensors such as TcpScope and useful, the ultimate test is how well the network serves end-
WebScope. users in their day-to-day activities. NetProfiler provides an
Such blame information is then aggregated across clients. end-user perspective on network performance, thereby comple-
The aggregate blame assigned to an entity is normalized to menting existing monitoring tools and enabling more informed
reflect the fraction of transactions involving the entity that network engineering decisions. For instance, significant packet
encountered a problem. The entities with the largest blame loss rate coupled with the knowledge that the egress link
score are inferred to be the likely trouble spots. utilization is low might point to a problem with chosen
The hierarchical aggregation scheme discussed in Section II- service provider and might suggest switching to a leased line
C naturally supports this distributed blame allocation scheme. alternative. Poor end-to-end performance despite a low packet
Clients use the performance they experienced to update the loss rate could be due to a large RTT, which could again be
performance records of entities at each level of the hierarchy. determined from NetProfiler observations. Remedial measures
Finding the suspect entity is then a question of walking might include setting up a local proxy cache or server replica.
up the attribute hierarchy to identify the highest-level entity 4) Network Health Reporting: The information gathered
whose aggregated performance information indicates a prob- by NetProfiler can be used to generate reports on the health
lem (based on suitably-picked thresholds). The preference for of wide-area networks such as the Internet or large enterprise
picking an entity at a higher level reflects the assumption that networks. While auch reports are available today from organi-
a single shared cause for the observed performance problems zations such as Keynote [4], the advantage of the NetProfiler
has a greater likelihood than multiple separate causes. For approach is lower cost, greater coverage, and the ability to
instance, if clients connected to most of the PoPs of Verizon operate virtually unchanged in restricted environments such
are experiencing problems, then the chances are that there is a as corporate networks as well as the public Internet.
general problem with Verizon’s network rather than a specific
problem at each individual PoP. III. E XPERIMENTAL R ESULTS
2) Comparative Analysis: A client might benefit from We present some preliminary experimental observations to
knowledge of its network performance relative to that of provide a flavor of the kinds of problems that the NetProfiler
other clients, especially those in the same vicinity (e.g., same system could address. Our experimental setup consists of a
city). Such knowledge can drive decisions such as whether set of a heterogeneous set of clients that repeatedly download
to upgrade to a higher level of service or switch ISPs. For content from a diverse set of 70 web sites during a 4-week
instance, a user who consistently sees worse performance than period (Oct 1-29, 2004). The client set includes 147 PlanetLab
others on the same ISP network and in the same neighborhood nodes, dialup hosts connected to 26 PoPs on the MSN network,
can demand an investigation by the ISP; in the absence of and 5 hosts on Microsoft’s worldwide corporate network. Our
comparative information, the user wouldn’t even know to goal was to emulate, within the constraints of the resources at
complain. A user who is considering upgrading from low-end our disposal, a set of clients running NetProfiler and sharing
to high-end DSL service could compare notes with existing information to diagnose problems. Here are a few interesting
high-end DSL users in the same locale to see how much observations:
improvement an upgrade would actually result in, rather than
¯ We observed several failure episodes during which
simply going by the speed advertised by the ISP.
Likewise, a consumer ISP that buys infrastructural services accesses to a web site failed at most or all of
such as modem banks and backhaul bandwidth from third- the clients. Examples include failure episodes involv-
party providers can monitor the performance experienced by ing [Link] and [Link]. The
its customers in different location. If it finds, for instance, widespread impact across clients in diverse locations
that its customers in Seattle are consistently underperforming suggests a server-side cause for these problems. It would
customers elsewhere, it would have reason to suspect the local be hard to make such a determination based just on the
infrastructure provider(s) in Seattle. view from a single client.
¯ There are significant differences in the failure rate ob-
2 The DNS latency may not be directly visible to a client if the request is served by clients that are seemingly “equivalent”. Among
made via a proxy. the MSN dialup nodes, those connected to PoPs with
4
ICG as the upstream provider experienced a much lower Bootstrapping is much more challenging in the organic
failure rate (0.2-0.3%) than those connected to PoPs with deployment model, where users install NetProfiler by choice.
other upstream providers such as Qwest and UUNET There is a chicken-and-egg problem between having a suffi-
(1.6-1.9%). This information can help MSN identify cient number of users to make the system useful and making
underperforming providers and take the necessary action the system useful enough to attract more users. To help
to rectify the problem. Similarly, clients in CMU have a bootstrap the system, we propose relaxing the insistence on
much higher failure rate (1.65%) than those in Berkeley passive monitoring by allowing a limited amount of active
(0.19%). This information can enable users at CMU probing (e.g., web downloads that the client would not have
pursue the matter with their local network administrators. performed in normal course). Clients could perform active
¯ Sometimes a group of clients shares a certain network downloads either autonomously (e.g., like Keynote clients)
problem that is not affecting other clients. The attribute(s) or in response to requests from peers. Of course, the latter
shared by the group might suggest the cause of the option should be used with caution to avoid becoming a
problem. For example, all 5 hosts on the Microsoft vehicle for attacks or offending users, say by downloading
corporate network experience a high failure rate (8%) in from “undesirable” sites. In any case, once the deployment
accessesing [Link], whereas the failure has reached a certain size, active probing could be turned off.
rate for other clients is negligible. Since the Microsoft
clients are located in different countries and connect via C. Security
different web proxies with distinct WAN connectivity, the The issues of privacy and data integrity pose significant
problem is likely due to a common proxy configuration challenges to the deployment and functioning of NetProfiler.
across the sites. These issues are arguably of less concern in a controlled
¯ In other instances, the problem is unique to a specific environment such as an enterprise.
client-server pair. For example, the Microsoft corpo- Users may not want to divulge their identity, or even
rate network node in China is never able to access their IP address, when reporting performance. To help protect
[Link] whereas other nodes, including the ones their privacy, we could give clients the option of identifying
at the other Microsoft sites, do not experience a problem. themselves at a coarse granularity that they are comfortable
This suggests that the problem is specific to the path with (e.g., at the ISP level), but that still enables interesting
between the China node and [Link] (e.g., site analyses. Furthermore, anonymous communication techniques
blocking by the local provider). If we had access to (e.g., [13]), that hide whether the sending node actually
information from multiple clients in China, we might be originated a message or is merely forwarding it, could be used
in a position to further disambiguate the possible causes. to prevent exposure through direct communication. However,
if performance reports were stripped of all client-identifying
IV. D ISCUSSION information, we would only be able to perform very limited
A. Deployment Models analyses and inference (e.g., we might only be able to infer
website-wide problems that affect most or all clients).
We envision two deployment models for NetProfiler: co- There is also the related issue of data integrity — an
ordinated and organic. In the coordinated model, NetProfiler attacker could spoof performance reports and/or corrupt the
is deployed by an organization such as the IT department of aggregation procedure. In general, guaranteeing data integrity
a large enterprise, to complement existing tools for network would require sacrificing privacy (e.g., [12]). However, in
monitoring and diagnosis. The fact that all client hosts are view of the likely usage of NetProfiler as an advisory tool,
in a single administrative domain simplifies the issues of we believe that it would probably be acceptable to have
deployment and security. In the organic model, on the other a reasonable assurance of data integrity, even if not iron-
hand, NetProfiler is installed by end users themselves (e.g., on clad guarantees. For instance, the problem of spoofing can
their home machines) in much the same way as they install be alleviated by insisting on a two-way handshake before
other peer-to-peer applications. They might do so to obtain accepting a performance report. The threat of data corruption
greater visibility into the cause of network connectivity and can be mitigated by aggregating performance reports along
performance problems that they encounter. This is a more multiple hierarchies and employing some form of majority
challenging deployment model, since issues of privacy and voting when there is disagreement.
security as well as bootstrapping the system become more
significant. We discuss these challenges next. V. R ELATED W ORK
In this section, we briefly survey existing tools and tech-
B. Bootstrapping niques for network monitoring and diagnosis, and contrast
To be effective, NetProfiler requires a sufficient number of them with NetProfiler.
clients that overlap and differ in attributes to participate, so Several tools have been developed for performing con-
that meaningful comparisons can be made and conclusions nectivity diagnosis from an end host (e.g., ping, traceroute,
drawn. The coordinated model makes this bootstrapping easy, pathchar [7], tulip [11]). While these tools are clearly useful,
since the IT department can very quickly deploy NetProfiler they have some limitations, including dependence on active
on a large number of clients in various locations throughout probing of routers (which may be expensive and also infeasible
the enterprise, essentially by fiat. in many cases), and a focus on just the IP-level path and the
5
view from a single host. In contrast, NetProfiler depends on VI. C ONCLUSION
passive observation of existing end-to-end transactions, and We have presented NetProfiler, a P2P system to enable
correlates information gathered from multiple vantage points monitoring and diagnosis of network problems. Unlike in
to diagnose problems. many previous P2P applications, the participation of peers
Network tomography techniques [5] leverage information is fundamental to the operation of NetProfiler. The results
from multiple IP-level paths to infer network health. However, from an initial 4-week experiment indicate the promise of the
tomography techniques are based on the analysis of fine- proposed approach. We believe that the capabilities provided
grained packet-level correlations, and therefore have typically by NetProfiler can benefit both end users and network oper-
involved active probing. Also, the focus is on a server-based, ators, such as consumer ISPs and enterprise IT departments.
“tree” view of the network whereas NetProfiler focuses on a In ongoing work, we are also exploring using end-host obser-
client-based “mesh” view. vations to detect large-scale surreptitious communication as
PlanetSeer [20] is a system to locate Internet faults by might precede a DDoS attack.
selectively invoking traceroutes from multiple vantage points. ACKNOWLEDGEMENTS
It is a server-based system (unlike NetProfiler), so the direction
of traceroutes matches the dominant direction of data flow. We thank our colleagues at the Microsoft Research locations
PlanetSeer differs from NetProfiler in terms of its dependence worldwide, MSN, and PlanetLab for giving us access to a
on active probing and focus on just the IP-level path. distributed collection of client hosts for our experiments. We
also thank Sharad Agarwal for his comments on an earlier
Tools such as NetFlow [8] and Route Explorer [1] enable draft.
network administrators to passively monitor network elements
such as routers. However, these tools do not directly provide R EFERENCES
information on the end-to-end health of the network. [1] [Link]
SPAND [14] is a tool for sharing performance information [2] [Link]
[3] [Link]
among end hosts belonging to a single subnet or site. The [4] Keynote Internet Health Report. [Link]
performance reports are stored in a central database and are [5] R. Caceres, N. Duffield, J. Horowitz, and D. Towsley. Multicast-based
used by end hosts for performance prediction and mirror inference of network-internal loss characteristics. IEEE Transactions on
Information Theory, November 1999.
selection. NetProfiler differs from SPAND in several ways, [6] D. Clark, C. Partridge, J. Ramming, and J. Wroclawski. A Knowledge
including its focus on fault diagnosis rather than performance Plane for the Internet. SIGCOMM, Aug. 2003.
prediction and use of a P2P approach that encompasses nodes [7] A. B. Downey. Using pathchar to Estimate Link Characteristics. In
SIGCOMM, 1999.
beyond the local subnet or site. [8] A. Feldmann, A. Greenberg, C. Lund, N. Reingold, J. Rexford, and
Several systems have been developed for distributed moni- F. True. Deriving traffic demands for operational ip networks: Method-
ology and experience. In SIGCOMM, 2001.
toring, aggregation, and querying on the Internet. Examples [9] P. B. Gibbons, B. Karp, Y. Ke, S. Nath, and S. Seshan. Irisnet: An
include Ganglia [2], Slicestat [3], IrisNet [9], PIER [10], architecture for a world-wide sensor web. IEEE Pervasive Computing,
Sophia [18], SDIMS [19], and Astrolabe [15]. NetProfiler October 2003.
[10] R. Huebsch, J. M. Hellerstein, N. Lanham, B. T. Loo, S. Shenker, and
could in principle leverage these systems for data aggregation, I. Stoica. Querying the internet with pier. In VLDB, 2003.
albeit with relaxed consistency and timeliness requirements. [11] R. Mahajan, N. Spring, D. Wetherall, and T. Anderson. User-level
The primary focus of our work is on leveraging end-host Internet Path Diagnosis. In SOSP, 2003.
[12] B. Przydatek, D. Song, and A. Perrig. Sia: Secure information aggre-
observations to diagnose network problems rather than on gation in sensor networks, 2003.
developing a new data aggregation system. [13] M. K. Reiter and A. D. Rubin. Crowds: anonymity for Web transactions.
ACM Transactions on Information and System Security, 1(1):66–92,
The Knowledge Plane proposal [6] shares NetProfiler’s goal 1998.
of enabling users to diagnose network problems. But it is [14] S. Seshan, M. Stemm, and R. H. Katz. Spand: Shared passive network
more ambitious in that the knowledge plane is envisaged as performance discovery. In USITS, 1997.
[15] R. van Renesse, K. Birman, and W. Vogels. Astrolabe: A robust and
encompassing not only the end users’ network experience but scalable technology for distributed system monitoring, management and
also network configuration and policy information. In contrast, data mining. ACM Transactions on Computer Systems, May 2003.
NetProfiler is designed to be deployable on today’s Internet [16] H. Wang, J. Platt, Y. Chen, R. Zhang, and Y. Wang. Automatic Mis-
configuration Troubleshooting with PeerPressure. In OSDI, December
with only the cooperation of (a subset of) end hosts. 2004.
Finally, like NetProfiler, STRIDER [17] and PeerPres- [17] Y. Wang, C. Verbowski, J. Dunagan, Y. Chen, Y. Chun, H. Wang, and
Z. Zhang. STRIDER: A Black-box, State-based Approach to Change
sure [16] also leverage information from peers to do cross- and Configuration Management and Support. In Usenix LISA, October
machine troubleshooting of configuration problems, by com- 2003.
paring the configuration settings of a sick machine with that [18] M. Wawrzoniak, L. Peterson, and T. Roscoe. Sophia: An information
plane for networked systems. In HotNets, November 2003.
of a healthy machine. NetProfiler is different in that it explic- [19] P. Yalagandula and M. Dahlin. A scalable distributed information
itly deals with information on specific problems (e.g., DNS management system. In SIGCOMM, 2004.
lookup failures for a particular server) rather than “blackbox” [20] M. Zhang, C. Zhang, V. Pai, L. Peterson, and R. Wang. PlanetSeer:
Internet Path Failure Monitoring and Characterization in Wide-Area
configuration information. Also, given the focus on wide-area Services. In OSDI, 2004.
network troubleshooting, NetProfiler requires the participation [21] Y. Zhang, L. Breslau, V. Paxson, and S. Shenker. On the Characteristics
of a larger number of peers in a diverse set of network and Origins of Internet Flow Rates. In SIGCOMM, 2002.
locations.