Real-Time Network Monitoring Data Processing
Real-Time Network Monitoring Data Processing
Abstract—The proper operation and maintenance of a network protocols. Their solution copies recently collected NetFlow
requires a reliable and efficient monitoring mechanism. The data to Hive tables in fixed intervals which doubles the
mechanism should handle large amount of monitoring data storage capacity requirement. Andersen et al. [5] described the
which are generated by different protocols. In addition, the
requirements (e.g. response time, accuracy) imposed by long- management of network monitoring datasets as a challenging
term planned queries and short-term ad-hoc queries should be task. They emphasized on the demand for a data management
satisfied for multi-tenant computing models. framework with the eventual consistency property and the real-
This paper proposes a novel mechanism for scalable storage time processing capability. The framework should facilitate
and real-time processing of monitoring data. This mechanism search and discovery by means of an effective query definition
takes advantage of a data-intensive framework for collecting
network flow information records, as well as data points’ indexes. and execution process. Balakrishnan et al. [6] and Cranor et
The design is not limited to a particular monitoring protocol, al. [1] proposed solutions for the real-time analysis of network
since it employs a generic structure for data handling. Thus, it’s data streams. However, they may not be efficient for the
applicable to a wide variety of monitoring solutions. analysis of high-speed streams in a long period [5].
I. I NTRODUCTION
B. Contributions
Monitoring and measurement of the network is a crucial
part of infrastructure operation and maintenance. A good A flexible and efficient mechanism is designed and imple-
understanding of the traffic passing through the network is mented for real-time storage and analysis of network flow in-
required for both planned and ad-hoc tasks. Capacity planning formation. In contrast to other solutions, which have analysed
and traffic matrix processing are planned, whereas traffic binary files on distributed storage systems, a NoSQL type of
engineering, load-balancing, and intrusion detection are ad- data store provides real-time access to a flexible data model.
hoc tasks which often require real-time behaviour. The data model flexibility makes it compatible with different
1) Storage Requirements: Quite often, ad-hoc tools are monitoring protocols. Moreover, the structure leads to fast
used for analysing network properties [1]. Traffic dumps or scanning of a small part of a large dataset. This property
flow information are common data type for an ad-hoc analysis. provides low latency responses which facilitate exploratory
The data volume for these types can be extremely large. and ad-hoc queries for researchers and administrators. The
2) Analytic Requirements: The storage should be dis- solution provides a processing mechanism which is about 4000
tributed, reliable, and efficient to handle high data input rate times faster than the traditional one.
and volume. Processing this large data set for an ad-hoc query The study concentrates on flow information records, due to
should be near real-time. It should be possible to divide and regulatory and practical limitations such as privacy directives
distribute the query over the cluster storing the data. and payload encryption. However, one can leverage the same
3) Privacy Policies: Storing the packet payload which solution for handling wider and richer datasets which contain
corresponds to the user data is restricted according to European application layer fields. This study is a part of our tenant-aware
data laws and regulations [2]. The same policy applies to the network monitoring solution for the cloud model.
flow information as well. The rest of the paper is organized as follows: Section
II explains the background information about data-intensive
A. Related Work processing frameworks and network monitoring approaches.
Li et al. [3] surveyed the state of the art in flow information Section III describes the Norwegian NREN backbone net-
applications. They identified several challenges in the fields work as a case study. Dataset characteristics and monitoring
such as: machine learning’s feature selection for an effective requirements of a production network are explained in this
analysis, real-time processing, and efficient storage of data section. Section IV introduces our approach toward solving
sets. Lee et al. [4] proposed a mechanism for importing data processing challenges for network monitoring. Section
network dumps (i.e. libpcap files) and flow information to V discusses technical details of the implementation as well
HDFS. They’ve implemented a set of statistical tools in as performance tunings for improving the efficiency. Section
MapReduce for processing libpcap files in HDFS. The tool VI evaluates the solution by performing common queries and
set calculates statistical properties of IP, TCP, and HTTP Section VII concludes the paper and introduces future works.
II. BACKGROUND of both scalability issues [1] and privacy policies [2]. Thus,
A. Framework for Data-Intensive Distributed Applications we are more interested in the packet header, and IP flow
information. An IP flow is a set of packets passing through
Using commodity hardware for storing and processing large a network between two endpoints, and matching a certain
sets of data is becoming very common [7]. There are multiple set of criteria, such as one or more identical header fields
proprietary, open-source frameworks and commercial services [18]. In our study, a flow is a canonical five-tuple: source
providing similar functionality such as: Apache’s Hadoop1 IP, source port, destination IP, destination port, and protocol.
[8] and related projects, Google’s File System (GFS) [9], Flow information is flushed out of the network device after 15
BigTable [10], Microsoft’s Scope [11], Dryad [12]. In the seconds of inactivity, 30 minutes of persistent activity, TCP
following, required components for the analysis and storage session termination, or when the flow buffer in the device is
of our dataset is explained. full. This makes the start and end time of a flow imprecise
1) File System (Hadoop Distributed FS): The first building [19]. IP flow information is an efficient data source for the
block of our solution, for handling network monitoring data, is real-time analysis of network traffic.
a proper file system. The chosen file system must be reliable, IP flow information can be exported using different pro-
distributed and efficient for large data sets. Several file systems tocols, in different formats. NetFlow [20], sFlow [21], and
can fulfil these requirements, such as Hadoop Distributed IP Flow Information Export (IPFIX) [18] are designed to
File System (HDFS) [8], MooseFS2 , GlusterFS3 , Lustre[13], handle network monitoring data. Collected data have a variety
Parallel Virtual File System (PVFS)[14]. Despite the variety, of use-cases. They can be used for security purposes, audit,
most of these file systems are missing an integrated processing accountability, billing, traffic engineering, capacity planning,
framework, except HDFS. This capability in HDFS makes it etc.
a good choice as the underlying storage solution.
2) Data Store (HBase): Network monitoring data, and C. Testing Environment
packet header information are semi-structured data. In a short We have implemented, optimized, and tested our suggested
period after their generation, they’re accessed frequently, and solution. The testing environment consists of 19 nodes, which
a variety of information may be extracted from them. Apache deliver Hadoop, HDFS, HBase, ZooKeeper, Hive services. The
HBase4 [15] is the most suitable non-relational data store for configuration for these nodes is as follows: 6x core AMD
this specific use-case. HBase is an open-source implementa- Opteron(tm) Processor 4180, 4x 8GB DDR3 RAM, 2x 3 TB
tion of a column-oriented distributed data source inspired by disks, 2x Gigabit NIC.
Google’s BigTable [10], which can leverage the MapReduce
processing framework of Apache. Data access in HBase is III. C ASE S TUDY: N ORWEGIAN NATIONAL R ESEARCH
key-based. It means a specific key or a part of it can be used AND E DUCATION N ETWORK (NREN)
to retrieve a cell (i.e. a record), or a range of cells [15]. As a This study focuses on the storage and processing of IP flow
database system, HBase guarantees consistency and partition information data for the Norwegian NREN backbone network.
tolerance from the CAP theorem [16] (aka. Brewer’s theorem). Two core routers, TRD GW 1 (in Trondheim) and OSLO GW
3) Processing Framework (Hadoop MapReduce): Process- (in Oslo), are configured to export flow information. Flow
ing large data sets has demanding requirements. The process- information are collected using NetFlow [20] and sFlow [21].
ing framework should be able to partition the data across
a large number of machines, and exposes computational fa- A. Data Volume
cilities for these partitions. The framework should provide Flow information is exported from networking devices at
the abstraction for parallel processing of data partitions and different intervals or events (e.g. 15 seconds of inactivity, 30
tolerate machine failures. MapReduce [17] is a programming minutes of activity, TCP termination flag, cache exhaustion).
model with these specifications. Hadoop is an open source The data are collected in observation points, and then the
implementation by Apache Software Foundation, which will anonymized data are stored for experiments. Crypto-PAn [22]
be used in our study. is used for the data anonymization. The mapping between the
original and anonymized IP address is ”one-to-one”, ”consis-
B. Network Monitoring
tent across traces”, and ”preserves prefix”.
This study focuses on the monitoring of backbone networks. Flow information is generated by processing a sampled set
The observation can be instrumented using Simple Network of packets. Although sampled data is not as accurate as not-
Management Protocol (SNMP) metrics, flow information (i.e. sampled one, studies showed they can be used efficiently for
packet header), and packet payload. SNMP does not deliver network operation and anomaly detection, by means of right
the granularity demanded by our use-case; also storing packets methods [23], [24].
payloads from a high capacity network is not feasible, because There need to be a basic understanding of the dataset for
1 http://hadoop.apache.org/
designing the proper data store. Data characteristics, common
2 http://www.moosefs.org/ queries and their acceptable response times are influential fac-
3 http://www.gluster.org/ tors in the schema design. The identifier for accessing the data
4 http://hbase.apache.org/ can be on one or more fields from the flow information record
TABLE I: Traffic Characteristics Trondheim Gateway 1
Statistics/day src
Traffic Type 2.5e+07 dst
Avg Max Min srcport
dstport
Distinct Source IPs 987104 4740760 122266 flow
Distinct Source IPs and 6083640 13188647 844898 netflow records
5e+06
(e.g. source or destination IP addresses, ports, Autonomous
Systems (AS), MACs, VLANs, interfaces, etc. ). Figure 1
0
depicts number of unique source, destination IP addresses, 2012-11-01 2012-12-01 2013-01-01 2013-02-01 2013-03-01 2013-04-01
Time
unique source IP:source port, destination IP:destination port
tuples, unique bidirectional flows (biflows), and flow informa- Fig. 1: Number of distinct source IPs, source IPs: source ports,
tion records per day for the TRD GW 1 in a 5 month period. destination IPs, destination IPs:destination ports, bidirectional
The summary of numeric values for TRD GW 1 and OSLO GW flows, and raw netflow records collected from Trondheim
is presented in Table I. gateway 1.
The average number of flow information records for both
routers is 22 millions per day, which corresponds to 60 GBs of
data in binary form. However, this number can become much A. Choice of Technologies
bigger if flow informations are collected from more sources Apache HBase satisfies our requirements (Section I) such
and the sampling rate is increased. as consistency and partition tolerance. Moreover, the data
staging is affordable by proper configuration of cache feature,
B. Data Access Methods in-memory storage size, in-filesystem storage size, regions
Monitoring data can be accessed for different purposes configuration and pre-splitting for each stage, and etc. For
such as: billing information, traffic engineering, security mon- instance, short-term data can be stored in regions with large
itoring, forensics. These purposes corresponds to a big set memory storage and enabled block cache. The block cache
of possible queries. The schema can be design such that it should be configured such that the Working Set Size (WSS) fits
performs very well for one group of queries. That may lead in memory [25]. While long-term archives are more suitable
to a longer execution time for the other query groups. Our for storage in the filesystem.
main goal is reaching the shortest execution time for security Hive5 is an alternative for HBase, which is not suitable
monitoring and forensics queries. Three types of queries are for our application. It doesn’t support binary key-values, and
studied, IP based: requires fast IP address lookups (e.g. all parameters are stored as strings. This approach demands
specific IPs, or subnets), Port based: requires fast Port address for more storage, and makes the implementation inefficient.
lookups (e.g. specific services), and Time based: requires fast While composite key structure is an important factor for
lookup on a time period. fast data access in the design, it is not supported by Hive.
Network monitoring data, and packet header information Although Hive provides an enhanced query mechanisms for
are semi-structured data. They have arbitrary lengths and a retrieving data, aforementioned issues make it inapplicable to
various number of fields. Storing this type of data as binary our purpose.
files in a distributed file system is challenging. The next section B. Design Criteria
discusses several storage schemas and their applicability to
desired access methods. A table schema in HBase has three major components:
rowkey, column-families, and columns structures.
IV. S OLUTION 1) Row Key: A rowkey is used for accessing a specific part
Two major stages in the life cycle of the monitoring data can of data or a sequence of them. It is a byte array which can have
be considered: short-term processing, and long-term archiving. a complex structure such as a combination of several objects.
Rowkey structure is one of the most important part of our
• Short-term processing: when collected monitoring data
study because it has a great impact on the data access time,
are imported into the data store, several jobs should
and storage volume demand. The followings are our criteria
be executed in real-time. These jobs generate real-time
for designing the rowkey:
network statistics, check for anomaly patterns and routing
• Rowkey Size: Rowkey is one of the fields stored in each
issues, aggregate data based on desired criteria, and etc.
cell, and is a part of a cell coordinate. Thus, it should
• Long-term archiving: Archived data can be accesses for
security forensics, or on-demand statistical analysis. 5 http://hive.apache.org/
be as small as possible, while efficient enough for data ning. Here, three table types are introduced, each addressing
access. one query category: IP-based, Port-based, and Time-based
• Rowkey Length (Variable versus Fixed): Fixed length tables.
rowkeys, and fields help us to leverage the lexicographi- 1) IP Based Tables:
cally sorted rows in a deterministic way. a) T1 (reference table), T2: The rowkey of this table
• Rowkey Fields’ Order (with respect to region load): consists of: source IP address, source port, destination IP
Records are distributed over regions based on regions’ address, destination port, and reverse timestamp (Table II).
key boundaries. Regions with high loads can be avoided Columns in this family are flexible and any given set can
by a uniform distribution of rowkeys. Thus, the position be stored there. Column qualifiers identifier are derived from
of each field in the rowkey structure is important. Statis- fields’ names, and their values are corresponding values from
tical properties of a field’s value domain are determining flow information records. Other tables are designed as sec-
factors for the field position. ondary indexes. They improve access time considerably for
• Rowkey Fields’ Order (with respect to query time): the corresponding query group. Table T1 is used for retrieving
Lexicographic order of rowkeys makes queries on the flow information parameters that are not in the rowkey (e.g.
leading field of a rowkey much faster than the rest. This is number of sent or received packets, bytes, flows)
the motivation for designing multiple tables with different Table T2 has destination address and port in the lead
fields order. Therefore, each table provides fast scanning position. This is used in combination with T1 for the analysis
functionality for a specific parameter. of bidirectional flows.
• Rowkey Fields’ Type: Fields of a rowkey are converted b) T3, T4: are suitable when source and destination
to byte arrays then concatenated to create the rowkey. addresses are provided by the query (Table II). For instance,
Fields’ types have significant effect on the byte array size. when two ends of a communication are known, and we want to
As an example number 32000 can be represented as a analyse other parameters such as: communication ports, traffic
short data type or as a string. However, the string data volume, duration, etc.
type require two times more number of bytes. 2) Port Based Tables:
• Rowkey Timestamps vs. Cell Version: It’s not rec- a) T5, T6: are appropriate tables for service discovery
ommended to set the maximum number of permitted (Table II). As an example, when we want to discover all nodes
versions too high [25]. Thus, there should be a timestamp delivering SSH service (on default port: 22), we can specify
for the monitoring record as a part of the rowkey. the lead fields on T5 and T6 (source and destination ports),
• Timestamps vs. Reverse Timestamps: In the first stage and let the data store returns all service providers and their
of data life cycle, recent records are frequently accessed. clients. If the client c1 is communicating on the port p1 with
Therefore, revere timestamps are used in the rowkey. the server s1 on the port 22 at time ts, then there is a record
2) Column Families: Column families are the fixed part with the rowkey: [22][s1][c1][p1][1-ts] in the data store.
of a table which must be defined while creating the schema. b) T7, T8: can fulfil the requirement for identifying
It’s recommended to keep the number of families less than clients who use a particular service (Table II). The same record
three, and those in the same table should have similar access from T5, T6 will have the rowkey: [22][c1][s1][p1][1-ts].
patterns and size characteristics (e.g. number of rows) [15]. 3) Time Based Tables: OpenTSDB6 is used for storing time
Column family’s name must be of string type, with a short series data. This can be an efficient approach for accessing
length. The family’s name is also stored in the cell, as a part and processing flows of a specific time period. A rowkey in
of the cell coordination. A table must have at least one column OpenTSDB consists of: a metric, a base timestamp, and a
family, but it can have a dummy column with an empty byte limited number of tags in the key-value format. Source and
array. We have used constant value D for our single column destination IP addresses and ports are represented as tags,
family across all tables. and a set of metrics are defined. Five fields from the flow
3) Columns: Columns are the dynamic part of a table information record are chosen as metrics: number of input
structure. Each row can have its own set of columns which and output bytes, input and output packets, and flows.
may not be identical to other rows’ columns. Monitoring data D. Storage Requirement
can be generated by different protocols, and they may not have The storage volume required for storing a single replication
similar formats/fields. Columns make the solution flexible and of a not-compressed record can be estimated using Equation
applicable to a variety of monitoring protocols. (1) as depicted in Table III. However, this estimation may vary
There are several tables with different fields’ orders in considerably if protocols other than NetFlow v5 and sFlow are
rowkeys, but not all of them have columns. Complete monitor- used for collecting monitoring data (e.g. IPFIX raw record can
ing record is just inserted into the reference table, and others be 250 bytes, containing 127-300 fields.)
are used for fast queries on different rowkey fields. Equation (2) is used for calculating the required capacity for
C. Schemas tables T2-T8 (See Table III). These tables don’t have columns
and values, which makes them much smaller than Table 1.
Section III-B explained desired query types and Section
IV-B1 described required properties of a rowkey for fast scan- 6 www.opentsdb.net
TABLE III: Storage requirements IPv4
Est. # records Storage for T1 Storage for T2-T8 Storage for OpenTSDB Total
Single Record 1 (37 ∗ 23) + (133) ∼ 1KB 7tables ∗ 23B = 161B 5metrics ∗ 2B = 10B ∼ 1KB
Daily Import ∼ 20million 1KB ∗ 20 ∗ 106 = 20GB 161B ∗ 20 ∗ 106 ∼ 3GB 10B ∗ 20 ∗ 106 = 200M B ∼ 23GB
Initial Import 20m ∗ 150days ∼ 3 ∗ 109 1KB ∗ 3 ∗ 109 = 3T B 161B ∗ 3 ∗ 109 ∼ 500GB 10B ∗ 3 ∗ 109 = 30GB ∼ 3.5T B
TABLE II: IP Based and Port Based Tables maximum number of written bytes per second is 81 MB/s. The
Table Row Key Query Type task is finished after 45.46 minutes. Therefore, a performance
T1 [sa] [sp] [da] [dp] [1 - ts] Extended queries tuning is required.
T2 [da] [dp] [sa] [sp] [1 - ts]
T3 [sa] [da] [sp] [dp] [1 - ts] Source-Destination B. Performance Tuning
address queries
T4 [da] [sa] [dp] [sp] [1 - ts] Source-Destination The initial implementation of the collection module was not
address queries optimized for storing large datasets. By investigating perfor-
T5 [sp] [sa] [da] [dp] [1 - ts] Service server dis- mance issues, seven steps are recognized as remedies [26],
covery queries
T6 [dp] [da] [sa] [sp] [1 - ts] Service server dis- [25]. These improvements will also enhance query execution
covery queries process, and are applied there as well.
T7 [sp] [da] [sa] [dp] [1 - ts] Service client dis- a) Using LZO compression: Although compression de-
covery queries
T8 [dp] [sa] [da] [sp] [1 - ts] Service client dis- mands more CPU time, the HDFS IO and network utilization
covery queries are reduced considerably. Compression is applied to store files
(HFiles) and the algorithm must be specified in the table
schema for each column family. The compression ratio is
dependent on the algorithm and the data type, and for our
X
|recordT 1 | = |cq| ∗ (|rk| + |cf n| + |cn|) + |cvi |) (1) dataset with the LZO algorithm the ratio is about 4.
i∈cq
b) Disabling Swap: Swappiness is set to zero on data
nodes, since there is enough free memory for the job to
|recordT 2−T 8 | = (|rk| + |cf n|) (2) complete without moving memory pages to the swap [26].
c) Disabling Write Ahead Log (WAL): All updates in a
where: region server are logged in WAL, for guaranteeing durable
|x| = x’s size in byte(s) writes. However, the write operation performance is improved
rk = row key (size = 23B) significantly by disabling it. This has the risk of data loss in
cf n = column family name case of a region server failure [25].
cq = column qualifiers set d) Enabling Deferred Log Flush (DLF): DLF is a table
cn = column qualifier name property, for deferring WAL’s flushes. If WAL is not disabled
cv = column value (due to the data loss risk), this property can specify the flushing
interval to moderate the WAL’s overhead [25].
V. I MPLEMENTATION e) Increasing heap size: 20TB of the disk storage is
A. Data Collection planned to be used for storing monitoring data. The formula
A set of MapReduce jobs and scripts are developed for calculating the estimated ratio of disk space to heap
for collecting, storing, and processing data in HBase and size is: RegionSize/M emstoreSize ∗ ReplicationF actor ∗
OpenTSDB7 . In the MapReduce job, the map task read flow HeapF ractionF orM emstores [27]. This leads to a heap
information files and prepare rowkeys as well as columns size of 10GB per region server.
for all tables. In the next step they are written into the f) Specifying Concurrent-Mark-Sweep Garbage Collec-
corresponding tables. After that another task checks data tion (CMS-GC): Full garbage collection has tremendous over-
integrity by a simple row counting job. This verification is head and it can be avoided by starting the CMS process
not fully reliable, but it is a basic step for the integrity check earlier. Initial occupancy fraction is explicitly specified to be
without scarifying performance. 70 percent. Thus, CMS starts when the old generation allocates
Performance evaluation was performed by processing more than 70 percent of the heap size [26].
records of a single day. The day is chosen randomly from g) Enabling MemStore-Local Allocation Buffers
working days of 2013. The statistical characteristics of the (MSLAB): MSLAB relaxes the issue with the old generation
chosen day represents properties of any other working days. heap fragmentation for HBase, and makes garbage collection
The performance of the implementation is not satisfactory in pauses shorter. Furthermore, it can improve cache locality by
this stage. For HBase, the maximum number of operations allocating memory for a region from a dedicated memory
per second is 50 with the maximum operation latency of area [28].
2.3 seconds. HDFS shows the same performance issue, the h) Pre-Splitting Regions: The pre-splitting of regions has
major impact on the performance of bulk load operations. It
7 Available at: https://github.com/aryantaheri/netflow-hbase can rectify the hotspot region issue and distribute the work load
TABLE IV: Initial region splits for tables T1-T4 (Store file lead to a uniform load in regions.
size in MBytes-Number of store files) In tables T1-T4, regions R4, R5, R10, R12 have big
Region Starting IP address T1 T2 T3 T4 store files compared to the rest of regions. Highly loaded
1 30-1 0-0 0-0 5-1 regions serve entries within the following IP address
2 17.17.17.17 23-1 0-0 0-0 0-0
3 34.34.34.34 32-1 6-1 5-1 0-0
spaces (anonymized) : R4 → [51.51.51.51, 68.68.68.68),
4 51.51.51.51 172-1 22-1 21-1 22-1 R5 → [68.68.68.68, 85.85.85.85), R10 → [153.153.153.153,
5 68.68.68.68 325-1 57-1 57-1 57-1 170.170.170.170), R12 → [187.187.187.187, 204.204.204.204)
6 85.85.85.85 77-1 11-1 10-1 11-1 By investigating these IP address blocks, we identified
7 102.102.102.102 85-1 9-1 13-1 0-0
8 119.119.119.119 57-1 11-1 0-0 11-1 that some of them contains Norwegian address blocks8 and
9 136.136.136.136 102-1 11-1 10-1 11-1 some others are popular services providers. In addition, empty
10 153.153.153.153 543-1 92-1 82-1 97-1 regions contain special ranges such as: private networks and
11 170.170.170.170 21-1 0-0 0-0 0-0
12 187.187.187.187 887-1 138-1 141-1 139-1
link-local addresses.
13 204.204.204.204 73-1 11-1 10-1 11-1 In tables T5-T8, regions R1, R12, R13, R14, R15 have
14 221.221.221.221 5-1 0-0 0-0 1-1 high loads, and they serve the following port numbers: R1
15 238.238.238.238 0-1 0-0 0-0 0-0 → [0, 4369), R12 → [48059, 52428), R13 → [52428, 56797), R14
→ [56797, 61166), R15 → [61166, 65536),
TABLE V: Initial region splits for tables T5-T8 (Store file size For tables T5-T8, R1 covers well known ports (both system
in MBytes-Number of store files) ports and user ports) suggested by Internet Assigned Num-
Region Starting Port number T5 T6 T7 T8 bers Authority (IANA)9 , and R12-R15 contains short-lived
1 197-1 137-1 198-1 137-1 ephemeral ports (i.e. dynamic/private ports). In the empirical
2 4369 7-1 0-0 0-0 0-0
3 8738 0-0 0-0 0-0 0-0
splitting, the difference between system ports, user ports, and
4 13107 0-0 0-0 0-0 0-0 private/dynamic (ephemeral) ports will be taken into account.
5 17476 0-0 0-0 0-0 0-0 A large fraction of records have port numbers of popular
6 21845 0-0 9-1 8-1 0-0 services (e.g. HTTP(S), SSH) or IP addresses of popular
7 26214 0-0 0-0 0-0 0-0
8 30583 0-0 0-0 0-0 10-1 sources/destinations (e.g. Norwegian blocks, popular services).
9 34952 0-0 12-1 0-0 12-1 Therefore, regions should not be split using a uniform distri-
10 39321 0-0 13-1 10-1 12-1 bution over the port number range or the IP address space.
11 43690 9-1 12-1 0-0 12-1
12 48059 37-1 49-1 38-1 60-1
The splitting is improved by taking these constrains into
13 52428 25-1 49-1 26-1 50-1 consideration and the result is significant. The average number
14 56797 25-1 37-1 25-1 38-1 of operations per second is 1600 (x64 more), the latency is
15 61166 26-1 24-1 25-1 25-1 5ms (x80 less), and the job duration is reduced to 6.57 minutes
(x7.5 faster). The results are depicted in Figure 2.
among all region servers. Each region has a start and an end VI. E VALUATION
rowkeys, and only serves a consecutive subset of the dataset.
This section analyses several query types and their response
The start and end rowkeys should be defined such that all
times.
regions will have a uniform load. The pre-splitting requires a
good knowledge of the rowkey structure and its value domain. A. Top-N Host Pairs
Tables T1-T4 start with an IP address, and T4-T8 have a
Finding Top-N elements is a common query type for many
port number in the lead position. Thus, they demand different
datasets. In our dataset, elements can be IP addresses, host
splitting criteria. The initial splitting uses a uniform distribu-
pairs, port numbers, etc. In the first evaluation, a query for
tion function, and later it’s improved by an empirical study.
finding Top-N host pairs is studied for a 150 days period.
IPv4 space has 232 addresses, and the address space is split
These pairs are hosts which have exchanged the most traffic
uniformly over 15 regions, as shown in Table IV. Furthermore,
on the network. The query requires processing of all records
port number is a 16 bit field with 65535 values and the same
in the table T1, and aggregation of input and output bytes for
splitting strategy is applied for it, Table V.
all available host pairs, independent of the connection initiator
The performance gain for storing a single day of flow and port numbers. Table T1 has 5 billion records.
information is considerable. On average, 754 HBase operations
Traditional tools (e.g. NFdump) are not capable of an-
are performed in a second (x30 more operations/s), the average
swering this query, because the long period corresponds to
operation latency is decreased to 27 ms (x14 faster), and
an extremely large dataset. For this purpose, two chaining
the job is finished in 15 minutes (x3 sooner). Despite high
MapReduce jobs are written for HBase tables. The first one
efficiency improvement, there are some hotspot regions which
identifies host pairs and aggregates their exchanged traffic. The
should be investigated more.
second one sorts pairs based on the exchanged traffic.
Tables IV and V show regions’ start keys,the number of
store files, and their sizes. It can be observed that splitting 8 http://drift.uninett.no/nett/ip-nett/ipv4-nett.html
(c) Number of operations per seconds in HBase (d) Operation latency in HBase
Fig. 2: Storage performance under different implementations (SNS: Single day processing without pre-splitting, SS: Single day
processing with a uniform splitting function, SSE: Single day processing with an empirical pre-splitting function)