A New Method For Flow-Based Network Intrusiondetection Using The Inverse Potts Model
A New Method For Flow-Based Network Intrusiondetection Using The Inverse Potts Model
Abstract—Network Intrusion Detection Systems (NIDS) play classification in real-time, a massive volume of data must
an important role as tools for identifying potential network be analyzed, making deep packet inspection too costly to
threats. In the context of ever-increasing traffic volume on be applied regarding processing and energy consumption.
computer networks, flow-based NIDS arise as good solutions
Since flow-based approaches can classify the whole traffic
arXiv:1910.07266v5 [cs.NI] 11 Mar 2021
based on the inverse Potts model to be employed in accuracy. Also, Ring et al. [23] explored slow port scans de-
NIDSs; tection using CIDDS-001. The approach proposed by them is
• A performance comparison of the proposed classifier capable of accurately recognizing the attacks with a low false
with classical ML-based classifiers using three different alarm rate. Finally, Abdulhammed et al. [24] also performed
datasets; classification based on flows on CIDDS-001 and proposed an
• An analysis of how different classifiers perform when approach that is robust considering imbalanced network traffic.
trained within one domain and tested in another related In summary, CIDDS-001 is an updated and relevant dataset to
domain. be used for network flow-classification solutions, being one of
The rest of this paper is structured as follows. In Section II, our dataset choices for assessing the performance of EFC.
we briefly present the state-of-the-art in flow-based NIDSs. The other two datasets considered here are CICIDS17 and
In Section II, we describe the structure of network flows CICDDoS19, from the Canadian Institute for Cyber Secu-
with a preliminary analysis of the datasets considered here. rity. Recently, Yulianto, Sukarno, and Suwastika [25] used
In Section III-C, we introduce the statistical model proposed CICIDS17 to assess the performance of an Adaboost-based
and the classifier implementation. In Section V, we present classifier. Aksu et al. [26] did the same in 2018 with
the results obtained regarding the statistical model’s analysis different ML classifiers. CICIDS17 contains benign as well
and the classification experiments performed. Finally, in Sec- as the most up-to-date common attacks, resembling true real-
tion VI, we present our conclusions and future work. world data, being a relevant dataset to consider for flow-
based traffic classification. Meanwhile, CICDDoS19 is a very
recent dataset with a focus on DDoS attacks. [27] proposes
II. R ELATED W ORK
a real-time entropy-based NIDS for detection of volumetric
In this section, we first briefly review the state-of-the-art DDoS in Internet of Things (IoT) and performs tests over
in flow-based network intrusion detection systems. In the CICDDoS19 dataset, among other datasets. Another recent
following, some previous work on CIDDS-001, CICIDS17, work [28] obtained over 99% accuracy over CICDDoS19
and CICDDoS19 are shown to highlight their relevance as up- dataset using a Convolutional Neural Network (CNN). And,
to-date datasets to be used in our experiments. Finally, some finally, Novaes et al. [29] proposed a system for intrusion
challenges of ML-based traffic classification are regarded, such detection based on fuzzy logic, which had its performance
as the difficulty in obtaining sufficient labeled data, the non- assessed on CICDDoS19. The rising popularity of this dataset
interpretability of models, and the difficulty in adapting to serves as evidence of its relevance to assess the performance
different domains (data distributions). of different NIDS. Hence, we use CICIDS17 and CICDDoS19
Several ML-based classifiers have been explored over the datasets to test our classifier and compare it to the performance
last years for network intrusion detection. Vinayakumar et al. of classical ML classifiers.
(2017) [16], Mahfouz et al. (2020) [17] and Khan et al. (2020) Umer et al. (2017) [6] performed a comprehensive literature
[18] independently evaluated the performance of different ML- survey on flow-based network intrusion detection. In their
based classifiers over internet traffic datasets. In [16], the work, some disadvantages of using ML-based classifiers for
KDDCup’99 and NSL-KDD datasets are regarded to evaluate traffic classification are mentioned. Among them are the high
the performance of both shallow and deep learning-based computational cost of training the classifiers, the difficulty in
classifiers. It is shown that deep learning-based approaches obtaining representative datasets, and the high false positive
performed better to differentiate malicious attacks from benign rates observed. The present work addresses some of the
traffic. Meanwhile, the authors of [17] considered the NSL- issues mentioned, since the classifier proposed here has a low
KDD dataset to compare the performance of different shallow computational cost and learns exclusively based on benign
learning-based classifiers. The classifier that presented the best samples. In the following, these and some other issues are
performance without feature selection was the Decision Tree discussed in further detail.
(DT). With feature selection, K-Nearest Neighbors (KNN) One of the commonly discussed issues in the field of ML
performed better to classify malicious traffic. Finally, the is the tight dependency most algorithms have on the amount
authors of [18] compared the performance of a few different of labeled samples available for training [7], which might be
classifiers over the UNSW-NB15 dataset. They observed that difficult to obtain in some contexts. For instance, it is difficult
Random Forest (RF) overperformed all other classifiers. In to obtain malicious traffic samples and to label it in the real
fact, RF has been used in several recent NIDSs [19], [20], world and this is why most of the network intrusion detection
[21]. All of the aforementioned works use the F1-score as a datasets contain simulated attacks. This issue makes it difficult
metric to assess the performance of the different classifiers. In to train intrusion detection algorithms in such a way that they
the present work, we consider both deep and shallow learning- might be able to detect zero-day threats [7]. The only way of
based classifiers as baselines to assess EFC’s performance over possibly detecting a zero-day attack is relying on an anomaly-
three different datasets, regarding the F1-score as one of the based classifier [30], such as the one we propose in this work.
evaluation metrics. EFC, has a great advantage over other ML-based algorithms,
To assess EFC’s performance, one of the datasets we use which is the capability to infer a model based solely on benign
is CIDDS-001. This dataset was used by Verma and Ranga traffic samples, i.e., half of the information. Such capability
(2018) [22] to assess the performance of KNN and k-means can be used to circumvent the problem of obtaining a high
clustering algorithms. Both algorithms achieved over 99% amount of data and the labeling of malicious samples.
3
Another common problem in ML is that inferred models is the most commonly used format, its main features are listed
lose their predictive performance when tested in different below:
domains (data distributions) [10]. In the field of network • Source/Destination IP (flow keys) - determine the origin
security, this adaptability is specially important given the and destination of a given flow in the network;
existence of zero-day threats and the artificiality of most • Source/Destination port (flow keys) - characterize differ-
datasets used for research. In [10], there is an interesting ent kinds of network services e.g., ssh service uses port
discussion about the existing differences between datasets used 22;
by academics to test NIDSs and the network traffic observed • Protocol (flow key) - characterizes flows regarding the
in the real world. Additionally, the work of Bartos et al. [8] transport protocol used e.g., TCP, UDP, ICMP.
and Li et al. [9] also address this issue. They propose similar • Number of packets (feature) - total number of packets
approaches, applying data transformations to the data to reduce captured in a flow;
differences between data distributions in different domains. In • Number of bytes (feature) - total number of bytes in a
our work, we propose a classifier that is intrinsically adaptable flow;
to different domains, since the model inference is based solely • Duration (feature) - total duration of a flow in seconds;
on benign samples. Therefore, there is no need to transform • Initial timestamp (feature) - system time when one flow
the data in order to adapt the model or perform adjustments started to be captured.
to a different domain, making our approach simpler and more Other features such as TCP Flags and Type of Service might
straightforward. also be exported in some cases. The combination of different
Finally, another big issue in ML is the non-interpretability of flow keys and features characterize one flow and determine its
some models [11], [12]. Artificial Neural Networks (ANNs), particular behavior.
in special, became more and more opaque with time, de- Flow-based approaches are seen as suitable alternatives to
spite overperforming other approaches in many tasks. The precede packet inspection in real-time NIDSs. The idea is to
authors of [12] highlight that the best ML algorithms are deeply inspect only the packets belonging to flows considered
not interpretable, hence the decision taken by them can not to be suspicious by the flow-based classifier. A two-step
be explained. However, different contexts require transparent approach would notably reduce the amount of data analyzed
decision making and that is why the development of ex- while maintaining a high classification accuracy [4]. In this
plainable models is so important. The authors of [11] call work, we are only concerned with the first step, which is
attention to the fact that trying to explain black box models the flow classification. We evaluate the performance of our
might not be the best approach to solve the issue of non- algorithm, the EFC, compared to other ML algorithms using
interpretability. Instead, it is suggested the design of new three different datasets. We also evaluate the performance
models that are inherently interpretable. In line with what of the algorithms by training with data from part of the
has been suggested by these recent studies, EFC generates dataset and testing with other parts of it. Nonetheless, although
a white-box model and, therefore, satisfies the requirement of both parts of the data come from the same dataset, their
providing explainable results, allowing classification results to distributions are different to characterize domain adaptation. In
be analysed in retrospect if needed. Thus, next, we introduce the following, we briefly describe the datasets used for testing
main concepts and intuitions to serve as basis for EFC. and characterize what constitutes a domain adaptation in each
of them.
labels suspicious and unknown, in turn, are used for real traffic. Table III
The external server is open to user access through the ports ATTACKS WITHIN CICIDS17 DATASET
80 and 443. Hence, flows directed at these ports were labeled
Week day Attacks
as unknown, since they could be either benign or malicious.
Monday
All flows directed at other ports were labeled as suspicious.
Traffic was sampled in both the simulated and the external Tuesday FTP-Patator, SSH-Patator
server environment for a period of four weeks. Within this DoS slowloris, DoS Slowhttptes, DoS Hulk, DoS
Wednesday
dataset, a change from the simulated data distribution to the GoldenEye, Heartbleed Port 444
external server data distribution is a domain change, requiring Brute Force, XSS, Sql Injection, Dropbox download,
the classifiers to adapt. Thursday
Cool disk
CIDDS-001 dataset flow features are shown in Table II.
Friday Botnet ARES, Port Scan, DDoS LOIT
All features were taken into account for characterization and
classification except for Src IP, Dest IP and Date first seen.
These exceptions are because the latter one is intrinsically
contains different modern reflective DDoS attacks such as
not informative to differentiate flows, and the former two are
PortMap, NetBIOS, LDAP, MSSQL, UDP, UDP-Lag, SYN,
made up in the context of the simulated network and might
NTP, DNS, and SNMP. The traffic was captured in January
be confounding.
(first day) and March (second day), 2019. Attacks were
executed during this period (see Table IV).
Table II
F EATURES WITHIN CIDDS-001 DATASET
Table IV
# Name Description ATTACKS WITHIN CICDD O S19 DATASET
Hamiltonian function H (a1 ...aN ), which depends on all spin M < Q, it is considered that values M + 1, ..., Q are possible,
states. but will never be observed empirically. For instance, if the
only possible values for feature protocol are {’TCP’, ’UDP’},
A) B)
Protocol and given Q = 4. In this case, we would have the mapping
ai hi(ai) TCP +1 {’TCP’:1, ’UDP’:2, ’ ’:3, ’ ’:4 } and feature values 3 and 4
eij(ai,aj)
would never occur.
eil(ai,al)
+2 +1 Now, let K be the set of all possible flows, i.e., all
eik(ai,ak) -3 possible combinations of feature values (K = ΩN ), and let
hl(al) hj(aj)
0
al
ejl(aj,al)
aj
80
-2
80
+1 S ⊂ K be a sample of flows. We can use inverse statistical
ekl(ak,al) ejk(aj,ak) DstPt -2 SrcPt
physics to infer a statistical model associating a probability
ak -1
0.1
0
P(ak1 ...akN ) to each flow k ∈ K based on sample S . The
hk(ak) Duration global statistical model P is inferred following the Entropy
Maximization Principle [36]:
Figure 1. A) Interacting spins on a crystalline lattice. B) Network flow
mapped into a graph structure.
max − ∑ P(ak1 ...akN )log(P(ak1 ...akN )) (1)
P
In this work, we reuse the intuitions from the Potts model k∈K
to characterize network flows (see Figure 1 B)). An individ- s.t.
ual flow k is represented by a specific graph configuration P(ak1 ...akN ) = fi (ai ) (2)
Gk (η, ε). Instead of spins, each node represents a selected
∑
k∈K |aki =ai
feature i ∈ η = {SrcPort, ..., Flags}. Within a given flow k, ∀i ∈ η; ∀ai ∈ Ω;
each network flow feature i assumes one value aki from the
set Ωi that contains all possible values for this feature. As in ∑ P(ak1 ...akN ) = fi j (ai , a j ) (3)
k∈K |aki =ai ,ak j =a j
the Potts Model, each feature i has an associated local field
hi (aki ). Meanwhile, ε = {(i, j)|i, j ∈ η; i 6= j} is the set of ∀(i, j) ∈ η 2 | i 6= j; ∀(ai , a j ) ∈ Ω2 ;
edges determined by all possible pairs of features, creating
where fi (ai ) is the empirical frequency of value ai on feature
a fully meshed graph linking different flow samples through
i and fi j (ai , a j ) is the empirical joint frequency of the pair of
their common features. Each edge has an associated coupling
values (ai , a j ) of features i and j. Note that constraints 2 and
value determined by the function ei j (aki , ak j ).
3 force model P to generate single as well as joint empirical
Since the values of local fields and couplings depend on the frequency counts as marginals. This way, the model is sure to
values assumed by features within a given flow, each distinct be coherent with empirical data.
flow will have a different combination of these quantities. As
Single and joint empirical frequencies fi (ai ) and fi j (ai , a j )
in the Potts Model, the Hamiltonian involving local fields and
are obtained from set S by counting occurrences of a given
couplings determines the total "energy" H (ak1 ...akN ) of each
feature value ai or feature value pair (ai , a j ), respectively, and
flow. For instance, in Figure 1 B), the total "energy" of the
dividing by the total number of flows in S . Since the set S
flow is obtained by summing up all values associated with the
is finite and much smaller than K , inferences based on S are
edges and to the nodes, resulting in a total of -3. Note that
subjected to undersampling effects. Following the theoretical
what we call energy is analogous to the notion of Hamiltonian
framework proposed in [34], we add pseudocounts to empirical
in Quantum Mechanics. It is important to note that the model
frequencies to limit undersampling effects by performing the
described here is discrete, therefore continuous features must
following operations:
be discretized. The classes for continuous feature discretization
are shown in Section V. In the following, we present the α
fi (ai ) ← (1 − α) fi (ai ) + (4)
framework applied to perform the statistical model inference Q
and subsequent energy-based flow classification. α
fi j (ai , a j ) ← (1 − α) fi j (ai , a j ) + 2 (5)
A. Model inference Q
In this section, a statistical model is going to be inferred in where (ai , a j ) ∈ Ω2 and 0 ≤ α ≤ 1 is a parameter defining the
terms of couplings and local field values to perform energy- weight of the pseudocounts. The introduction of pseudocounts
based flow classification. The main idea consists in extracting is equivalent to assuming that S is extended with a fraction
a statistical model from benign flow samples to infer coupling of flows with uniformly sampled features.
and local field values that characterize this type of traffic. The proposed maximization can be solved using a La-
When calculating the energies of unlabeled flows using the grangian function such as presented in [36], yielding the
inferred values, it is expected that benign flows will have lower following Boltzmann-like distribution:
energies than malicious flows.
e−H (ak1 ...akN )
Let (A1 ...AN ) be an N-tuple of features, which can be in- P∗ (ak1 ...akN ) = (6)
stantiated for flow k as (ak1 ...akN ), with ak1 ∈ Ω1 , ..., akN ∈ ΩN . Z
Each feature value aki is encoded by an integer from the set where
Ω = {1, 2, ..., Q}, i.e., all feature alphabets are the same Ωi = Ω H (ak1 ...akN ) = − ei j (aki , ak j ) − ∑ hi (aki ) (7)
of size Q. If a given feature can only assume M values and
∑
i, j|i< j i
6
is the Hamiltonian of flow k and Z (eq. (6)) is the partition action of a feature with its neighbors is replaced by an
function that normalizes the distribution. Since in this work we approximate interaction with an averaged feature, yielding an
are not interested in obtaining individual flow probabilities, approximated value for the local field associated to it.
Z is not required and, as a consequence, its calculation is For further details about these calculations, please refer to
omitted. Our objective is to calculate individual flows energies, [33]. Now that all model parameters are known, it is possible
i.e., individual Hamiltonians as determined in eq. (7). to calculate a given flow energy according to eq. (7). In the
Note that the Hamiltonian, as presented above, is fully following, we are going to present the theoretical framework
determined regarding the Lagrange multipliers ei j (·) and hi (·) implementation to perform a two-class, i.e., benign and mali-
associated to constraints (2) and (3), respectively. Within cious, flow classification, i.e., Energy-based flow classification
the Potts Model framework, the Lagrange multipliers have a (EFC).
special meaning, with the set {ei j (ai , a j )|(ai , a j ) ∈ Ω2 } being
the set of all possible coupling values between features i and B. Energy-based flow classification
j and {hi (ai )|ai ∈ Ω} the set of possible local fields associated
to feature i. The energy of a given flow can be calculated according to
eq. (7) based on the values of its features and the parameters
Inferring the local fields and pairwise couplings is dif- from the statistical model inferred in Section IV-A. A given
ficult since the number of parameters exceeds the number flow energy is the negative sum of couplings and local fields
of independent constraints. Due to the physical properties associated with its features, according to a given statistical
of interacting spins, it is possible to infer pairwise coupling model. It means that a flow that resembles the ones used to
values ei j (ai , a j ) using a Gaussian approximation. Assuming infer the model is likely to be low in energy.
that the same properties apply for flow features, we infer Since EFC is an anomaly-based classifier, the statistical
coupling values as follows: model used for classification is inferred based only on benign
ei j (ai , a j ) = −(C−1 )i j (ai , a j ), (8) flow samples. We would then expect the energies of benign
2 2 samples to be lower than the energies of malicious samples. In
∀(i, j) ∈ η , ∀(ai , a j ) ∈ Ω , ai , a j 6= Q
other words, what the energy value of a given flow is capturing
where is how dissimilar that flow is to a set of known benign flows
Ci j (ai , a j ) = fi j (ai , a j ) − fi (ai ) f j (a j ) (9) used to infer the model in the training phase. In terms of
frequencies, this means that, if a given flow presents feature
is the covariance matrix obtained from single and joint empir- values combinations that are very frequent in benign flow
ical frequencies. Taking the inverse of the covariance matrix samples, its energy will be low. In this sense, it is possible to
is a well known procedure in statistics to remove the effect of classify flow samples as benign or malicious based on a chosen
indirect correlation in data [37]. Now, it is important to clarify energy threshold. The classification is performed by stating
that the number of independent constraints in eq. (2) and eq. that samples with energy smaller than the threshold are benign,
(3) is actually N(N−1)
2 (Q − 1)2 + N(Q − 1), even though the and samples with energy greater than or equal to the threshold
N(N−1) 2
model in eq. (6) has 2 Q + NQ parameters. So, without are malicious. Note that the threshold for classification can be
loss of generality, we set: chosen in different ways, and it can be static or dynamic. In
ei, j (ai , Q) = ei, j (Q, a j ) = hi (Q) = 0 (10) this work, we will consider a static threshold.
Algorithm 1 shows the implementation of EFC. In lines 2-
Thus, in eq. (8) there is no need to calculate ei, j (ai , a j )in case 5, the statistical model for the sampled flows is inferred, as
ai or a j is equal to Q [34]. Afterwards, local fields hi (ai ) can described by eqs. (4), (5), (8) and (12). Afterward, on lines 6-
be inferred using a mean-field approximation [38]: 27, the classifier monitors the network waiting for a captured
! flow. When a flow is captured, its energy is calculated on lines
fi (ai )
= exp hi (ai ) + ∑ ei j (ai , a j ) f j (a j ) , (11) 9-20, according to the Hamiltonian in eq. (7). The computed
fi (Q) j,a j flow energy is compared to a known threshold (cutoff ) value
∀i ∈ η, ai ∈ Ω, ai 6= Q on line 21. In case the energy falls above the threshold, the
flow is classified as malicious and should be forwarded to deep
where fi (Q) is the frequency of the last element ai = Q packet inspection (line 23) for assessment. Otherwise, the flow
for any feature i used for normalization. It is also worth is released, and the classifier waits for another flow.
mentioning that the element Q is arbitrarily selected and could It is essential to highlight that the time complexity of the
be replaced by any other value in {1. . . Q} as long as the training step of EFC is O((M × Q)3 + N × M 2 × Q2 ), where
selected element is kept the same for calculations of the local N is the number of samples, M is the number of features,
fields of every feature i ∈ η. Note that in eq. (11) the empirical and Q is the size of the alphabet. Meanwhile, the complexity
single frequencies fi (ai ) and the coupling values ei j (ai , a j ) are of the classification step for each sample is O(M 2 ). It means
known, yielding: that, in both steps, the complexity is more dependant on the
number of features chosen, which can be kept small by using
fi (ai )
hi (ai ) = ln − ∑ ei j (ai , a j ) f j (a j ) (12) a feature selection mechanism, e.g., Principal Component
fi (Q) j,a j
Analysis (PCA). However, we do not currently explore any
In the mean-field approximation presented above, the inter- feature selection mechanisms because we consider it to be out
7
of scope of this work, in which the main aim is only to present V. R ESULTS
a first version of our newly proposed classifier for NIDS. In this section, we present the results obtained for EFC and
ML-based classifiers in different binary classification exper-
Algorithm 1 Energy-based Flow Classifier iments considering three different datasets, i.e., CIDDS-001,
Input: benign_ f lows(K×N) , Q, α, cuto f f CICIDS17, and CICDDoS19. First, we show that EFC can
separate benign from malicious flows based on their energies,
1: import all model inference functions
a result that is consistent for all considered datasets. Then, we
2: f _i ← SiteFreq(benign_ f lows, Q, α)
present EFC’s classification performance and compare it to the
3: f _i j ← PairFreq(benign_ f lows, f _i, Q, α)
classification performance of ML-based classifiers in different
4: e_i j ← Couplings( f _i, f _i j, Q)
experiments.
5: h_i ← LocalFields(e_i j, f _i, Q)
It is important to highlight that the classification experi-
6: while Scanning the Network do
ments we perform in this work were designed not only to
7: f low ← wait_for_incoming_flow()
assess the performance of different classifiers but also to
8: e ←0
investigate their capability of adaptation to different domains,
9: for i ← 1 to N − 1 do
i.e., data distributions. Hence, we performed two kinds of
10: a_i ← f low[i]
experiments: training/testing in the same domain, and train-
11: for j ← i + 1 to N do
ing/testing in different domains. For training/testing in the
12: a_ j ← f low[ j]
same domain, in each experiment, we assessed the average
13: if a_i 6= Q and a_ j 6= Q then
performance of the classifiers over ten different test sets,
14: e ← e − e_i j[i, a_i, j, a_ j]
containing 10,000 benign and 10,000 malicious samples each,
15: end if
randomly selected from the full dataset. Models were inferred
16: end for
based on 80% the test set and tested on the remaining 20%.
17: if a_i 6= Q then
The inferred models were used for each experiment to assess
18: e ← e − h_i[i, a_i]
the performance of the classifiers over another ten test sets
19: end if
composed of 2,000 benign and 2,000 malicious samples from
20: end for
another domain (data distribution).
21: if e ≥ cuto f f then
22: stop_flow()
23: forward_to_DPI() A. EFC characterization
24: else To assess EFC capabilities to separate benign from mali-
25: release_flow() cious traffic flow samples correctly, we performed classifica-
26: end if tion experiments considering datasets CIDDS-001, CICIDS17
27: end while and CICDDoS19. First, we inferred models based on benign
samples from the OpenStack (simulated) environment within
the CIDDS-001 dataset. This models were used to calculate
In Table V, it is possible to see that EFC has a low training
the energy of different benign and malicious flow samples
cost, linear in the number of samples (N), when compared to
coming also from the simulated traffic. Figure 2A shows
ML-based classifiers, such as Decision Tree (DT), Random
energy values of 40,000 classified flow samples, a merge of
Forest (RF) and Support Vector Machine (SVM). EFC’s train-
the results obtained over ten randomly sampled test sets, as
ing complexity is considered to be dominated by the term NM 2
described in the last paragraph of the previous section. The
because the number of training samples is expected to be much
statistical model used to calculate the energies in each test set
bigger than both the number of features (M) and the size of the
was inferred based on 8,000 benign flows randomly sampled
alphabet (Q), which means that the term (MQ)3 is likely not
from the simulated traffic. Flow samples with energy values
dominant over NM 2 . Considering the implementation shown
falling above the energy cutoff, defined as the 95th percentile
in this section, we present the results obtained using EFC and
of the benign traffic training distribution (red dashed line),
ML-algorithms in classification experiments in the following.
would be classified as malicious, while the remaining samples
would be classified as benign. It is possible to observe that
Table V the separation between the two flow classes is clear, i.e., the
T RAINING COMPLEXITY OF DIFFERENT ML- ALGORITHMS [39] AND EFC energy distribution of tested benign flows falls mostly on the
left side of the cutoff line, while the energy distribution of
Algorithm Time complexity Notes
tested malicious flows falls mostly on the right side of the
ANN O(EMNK) E: number of epochs; K: number of neurons
cutoff line, as expected.
DT O(MNlog(N))
Figure 2B-C shows the results of an analogous experiment
KNN O(Nlog(K)) K: number of neighbors
performed on the remaining two datasets. Similarly to what
RF O(T MNlog(N)) T : number of trees
is observed for CIDDS-001, when trained on CICIDS17,
SVM O(N 2 )
EFC O(NM 2 )
EFC is also capable of clearly separating the two classes
(Figure 2B). This means that the benign energy histogram
of tested samples falls mostly on the left side of the cutoff
8
A
Probability Train/Test simulated (CIDDS-001) B Train/Test CICIDS17 C Train/Test CICDDoS19
Figure 2. Energy histograms of benign (n = 20,000 in each plot) and malicious (n = 20,000 in each plot) flow samples obtained in the testing phase of a
classification experiment performed over the CIDDS-001 dataset (A), CICIDS17 (B) and CICDDoS19 (C) datasets. The energy threshold for classification is
shown as a red dashed line and corresponds to the 95th percentile of the energy distribution obtained in the training phase.
line (95th percentile of the training distribution), while the which pairs of features are similar to normal traffic (blue
malicious energy histogram of tested samples falls mainly on squares) and might be confounding the model. It is interesting
the right side of the cutoff line. Again, the same result can be to note that different kinds of attacks are characterized by
observed for dataset CICDDoS19 (Figure 2C). It is interesting different combinations of abnormal feature pairs, as expected.
to observe that, although the benign flows energy histograms For instance, the most abnormal thing about DoS attacks is
looks similar regarding variance for the three datasets, the the combination of number of packets and duration, while for
malicious flows energy histogram varies. In CIDDS-001, it has port scan attacks is the combination of source and destination
very low variance, reflecting the fact that this dataset contain ports. This analysis was done considering only the couplings
only four classes of attacks and is highly imbalanced, while and not the local fields. An analogous breakdown can be done
in CICIDS17 and CICDDoS19, malicious energy histograms for individual flow samples, allowing for the understanding
have a broader spread, reflecting the greater variability of of which features cause a specific sample to be classified as
malicious flows that exists in those datasets. malicious or benign.
In summary, the results presented in this subsection show
that EFC can correctly discriminate between the two flow
classes considered, i.e., benign and malicious, and the results
are consistent for all datasets considered. In addition, it was
shown how the total energy of different attack classes can be
broken down and analyzed in detail. This is illustrative of the
white box nature of the statistical model inferred by EFC.
In the following, classification results are shown for different
classifiers and compared with the results obtained for EFC.
as malicious, FP are the false positives, i.e., benign traffic Table VII
classified as malicious, and FN are the false negatives, i.e., ma- AVERAGE COMPOSITION OF EACH OF THE TEST SETS IN EXPERIMENT 1
licious traffic classified as benign. The second metric, the area
CIDDS-001 OpenStack CIDDS-001 real traffic
under the ROC curve (AUC), is one of the most widespread
evaluation metrics for binary classifiers [48], [49]. The ROC Label Number Label Number
curve is constructed by plotting the true positive rate (TPR) normal 10,000 unknown 2,000
against the false positive rate (FPR) at different classification dos 9,800 suspicious 2,000
thresholds. It means that the AUC is the probability that pingScan 20
a randomly chosen positive example will receive a higher portScan 150
score than a randomly chosen negative one. One of the main
bruteForce 30
advantages of the AUC is that it is invariant to changes in
class distribution, i.e., the ROC curve will not change if the Total 20,000 Total 4,000
distribution changes in a test set, but the underlying conditional
distributions from which the data are drawn stay the same [50], Table VIII
[49]. Since we are interested in evaluating domain adaptation, AVERAGE COMPOSITION OF EACH OF THE TEST SETS IN CROSS - DATASET
this metric is particularly interesting to be adopted in this EXPERIMENTS 2 AND 3
work.
CICIDS17 CICDDoS19
Table VI shows the classes considered for feature dis-
Label Number Label Number
cretization on CIDDS-001 dataset. Since TCP Flags is the
discrete feature with more possible values (32 possibilities), benign 10,000 benign 10,000
the alphabet size Q was set to 32. Each continuous feature FTP Patator 170 DrDoS DNS 890
values were clustered in a certain number of classes (or bins), SSH Patator 80 DrDoS LDAP 370
up to Q classes. Classes were determined in such a way that DDoS 2,740 DrDoS MSSQL 890
the number of values within each class was similar for all PortScan 1,060 DrDoS NetBIOS 880
classes. Features within datasets CICIDS17 and CICDDoS19 Bot 110 DrDoS NTP 890
were also discretized in such a way that the number of values Infiltration 20 DrDoS SNMP 200
within each bin was similar for all bins. These discretizations
Brute force 50 DrDoS SSDP 890
are not shown here because of the high number of features
SQL injection 10 DrDoS UDP 890
( 80 features) in these datasets.
XSS 10 Syn 890
Table VI DoS Hulk 2,730 TFTP 890
C LASSES CONSIDERED FOR FEATURE DISCRETIZATION ON CIDDS-001 DoS GoldenEye 2,730 LDAP 120
Feature List of classes upper limits
DoS Slowloris 120 NetBIOS 140
Duration 0.001, 0.002, 0.003, 0.004, 0.005, 0.006, 0.01, 0.04, 1, 10, 100, ∞ DoS Slowhttptest 170 MSSQL 660
Protocol TCP, UDP, GRE, ICMP, IGMP Portmap 400
Src Port 50, 60, 100, 400, 500, 40000, 60000, ∞
UDP 880
Dst Port 50, 60, 100, 400, 500, 40000, 60000, ∞
Num. of Bytes 50, 60, 70, 90, 100, 110, 200, 300, 400, 500, 700, 1000, 5000, ∞
UDPLag 120
Num. of Packets 2, 3, 4, 5, 6, 7, 10, 20, ∞ Total 20,000 Total 20,000
TCP Flags {( f1 , f2 , f3 , f4 , f5 )| fi ∈ {0, 1}}
and 20% for testing. The test sets containing samples from a
To evaluate EFC’s performance compared to other clas- different domain were not used for training, hence they were
sifiers, we performed three independent experiments. The composed of only 2,000 benign and 2,000 malicious samples,
first experiment was performed on CIDDS-001. Training was randomly selected from the full dataset. EFC’s cutoff was
performed on simulated flow samples, while testing was per- defined to be at the 95th percentile of the energy distribution
formed on both simulated and real flow samples captured in obtained in the training phase based solely in benign sam-
an external server. The second and the third experiments were ples. It means that we used a completely statistical threshold
cross-dataset experiments performed on CICIDS17 and CICD- based only in the training considering benign traffic without
DoS19. In the former, training was performed on CICIDS17, adjustments based on malicious samples, such as other ML-
with testing on both datasets, while in the latter training was Algorithms require. The average composition of the test sets
performed on CICDDoS19, with testing on both datasets. is shown in Table VII and VIII.
Essentially, in each experiment, we measured the perfor- Table IX shows the average performance and standard
mance of the classifiers when trained and tested in the same error (95% confidence interval) of each classifier on the first
domain and when trained in one domain and tested in a experiment considering CIDDS-001 dataset. When trained
different one. The performance was measured as the average and tested in the same simulated environment, DT is the
over ten different test sets, composed of 10,000 benign and algorithm presenting the best performance, with an F1-score of
10,000 malicious samples each, randomly selected form the 0.999 ± 0.000 and 0.999 ± 0.000 AUC. EFC does also perform
full dataset. With 80% of each test set being used for training well, being the second best in terms of AUC (0.997 ± 0.001).
10
When trained in the simulated environment and tested in the AUC (0.670 ± 0.002). EFC’s AUC (0.664 ± 0.002) was the
real environment, EFC outperforms the other classifiers (both second best, which means that EFC performance was good,
simple and ensemble methods) in F1-score (0.675 ± 0.009) taking into consideration both the F1-score and the AUC.
and AUC (0.720 ± 0.001). It is noteworthy that all algorithms Even though this adaptation seems more challenging than the
present a considerable degradation in performance when tested previous ones, EFC’s performance was consistent on all the
in a different domain, showing how sensitive the inferred experiments performed.
models are to changes in data distribution.
Table XI
Table IX AVERAGE CLASSIFICATION PERFORMANCE AND STANDARD ERROR
AVERAGE CLASSIFICATION PERFORMANCE AND STANDARD ERROR (95% CI) - TRAINING PERFORMED ON CICDD O S19
(95% CI) - TRAINING PERFORMED ON CIDDS-001 SIMULATED TRAFFIC
Train/Test CICDDoS19 Train CICDDoS19/Test CICIDS17
Train/Test simulated Train simulated/Test real Classifier F1 score AUC F1 score AUC
Classifier F1 score AUC F1 score AUC NB 0.590 ± 0.006 0.428 ± 0.007 0.590 ± 0.006 0.428 ± 0.006
NB 0.043 ± 0.016 0.502 ± 0.004 0.057 ± 0.024 0.517 ± 0.016 KNN 0.960 ± 0.002 0.984 ± 0.001 0.397 ± 0.043 0.670 ± 0.002
KNN 0.988 ± 0.001 0.994 ± 0.000 0.118 ± 0.014 0.524 ± 0.001 DT 0.998 ± 0.000 0.998 ± 0.000 0.259 ± 0.012 0.476 ± 0.000
DT 0.999 ± 0.000 0.999 ± 0.000 0.556 ± 0.007 0.619 ± 0.000 SVM 0.933 ± 0.002 0.976 ± 0.002 0.239 ± 0.009 0.538 ± 0.002
SVM 0.805 ± 0.003 0.951 ± 0.002 0.531 ± 0.005 0.707 ± 0.003 MLP 0.968 ± 0.002 0.993 ± 0.001 0.227 ± 0.011 0.451 ± 0.002
MLP 0.979 ± 0.002 0.993 ± 0.001 0.151 ± 0.016 0.596 ± 0.002 EFC 0.916 ± 0.002 0.981 ± 0.001 0.641 ± 0.002 0.664 ± 0.002
EFC 0.975 ± 0.001 0.997 ± 0.001 0.675 ± 0.009 0.720 ± 0.001 Ensemble
Ensemble AB 0.995 ± 0.001 1.000 ± 0.000 0.270 ± 0.013 0.660 ± 0.001
AB 0.999 ± 0.000 1.000 ± 0.000 0.594 ± 0.022 0.630 ± 0.000 RF 0.997 ± 0.000 1.000 ± 0.000 0.089 ± 0.032 0.623 ± 0.000
RF 0.999 ± 0.000 1.000 ± 0.000 0.269 ± 0.018 0.714 ± 0.000
Taken as a whole, the results presented in this subsection
Table X shows the results of experiment two, which was per- show that EFC is better at adapting to other domains than
formed on CICIDS17 and CICDDoS19 datasets. When trained classical ML-based classifiers on average. In addition to that,
and tested on CICIDS17, DT is again the algorithm to present it is possible to see that EFC achieves AUC values similar
the best performance both in terms of F1-score (0.994±0.001) to the best ML algorithms when trained and tested in the
and AUC (0.994 ± 0.001), though indistinguishable from MLP same domain, showing that it is capable of performing well
AUC (0.993 ± 0.001). Notably, when trained on CICIDS17 even if trained with only half of the information (benign data
and tested on CICDDoS19, EFC outperformed the other only) when compared to other classifiers (using malicious and
simple algorithms in both F1-score (0.787 ± 0.004) and AUC benign data). Not using malicious samples in the training
(0.781 ± 0.003). Again, it is possible to see that EFC is the phase is likely to be the reason why EFC is so good at
best algorithm in adapting to a different data distribution adapting to other domains. EFC’s increased capability for
when evaluating both metrics. However, when considering domain adaptation when there is a significant difference in data
also ensemble methods, RF outperforms EFC, which becomes distribution is a highly desirable trait in network flow-based
second best in terms of AUC. classifiers, since changes in traffic composition are expected
to be very frequent, and new kinds of attacks are generated
Table X
AVERAGE CLASSIFICATION PERFORMANCE AND STANDARD ERROR continuously.
(95% CI) - TRAINING PERFORMED ON CICIDS17 Finally, we believe EFC to be an interesting tool for network
managers, given (i) its more realistic requirements for training
Train/Test CICIDS17 Train CICIDS17/Test CICDDoS19
(only benign traffic that can be easily captured in the target
Classifier F1 score AUC F1 score AUC
network), (ii) its adaptability when faced with changes in
NB 0.344 ± 0.049 0.413 ± 0.021 0.344 ± 0.049 0.413 ± 0.049
KNN 0.961 ± 0.001 0.987 ± 0.001 0.457 ± 0.046 0.771 ± 0.001
traffic patterns, and (iii) the possibility to identify which
DT 0.994 ± 0.001 0.994 ± 0.001 0.168 ± 0.090 0.525 ± 0.001
flow features cause a specific network flow to be classified
SVM 0.930 ± 0.003 0.974 ± 0.001 0.264 ± 0.025 0.664 ± 0.003 as benign or malicious. However, there is still great room
MLP 0.961 ± 0.003 0.993 ± 0.001 0.221 ± 0.033 0.775 ± 0.003 for improvement. One possibility would be to incorporate
EFC 0.898 ± 0.003 0.975 ± 0.001 0.787 ± 0.004 0.781 ± 0.003 EFC as a first step of a two-step NIDS, in which the flow
Ensemble samples detected as malicious by EFC would be then sent to
AB 0.991 ± 0.002 1.000 ± 0.000 0.228 ± 0.055 0.698 ± 0.002 deep package inspection. Another possibility is to implement
RF 0.997 ± 0.001 1.000 ± 0.000 0.021 ± 0.003 0.867 ± 0.001 a dynamic threshold that would be adaptable to different
network situations, improving classification accuracy. There is
Further, Table XI shows the results of experiment three, also the possibility of performing feature selection previous
which was performed on CICIDS17 and CICDDoS19 datasets. to model inference, which would greatly reduce the time
Once more, DT overperformed the other classifiers when spent in the model inference phase and possibly also improve
training and testing on the same dataset, with F1-score of classification accuracy. And finally, it would be possible to
0.998 ± 0.000 and AUC of 0.998 ± 0.000. When tested on implement EFC to perform traffic classification at different
the CICDDoS19 dataset though, EFC achieved the best F1- points in a distributed network. In the following, we present
score (0.641 ± 0.002), while KNN was the best in terms of our conclusions and future work directions.
11
Detection of Imbalanced Network Traffic,” IEEE Sensors Letters, vol. 3, [50] S. Wu, P. Flach, and C. Ferri, “An improved model selection heuristic
no. 1, pp. 1–4, jan 2019. for auc,” in European Conference on Machine Learning. Springer,
[25] A. Yulianto, P. Sukarno, and N. A. Suwastika, “Improving adaboost- 2007, pp. 478–489.
based intrusion detection system (ids) performance on cic ids 2017
dataset,” in Journal of Physics: Conference Series, vol. 1192, no. 1.
IOP Publishing, 2019, p. 012018.
[26] D. Aksu, S. Üstebay, M. A. Aydin, and T. Atmaca, “Intrusion detec-
tion with comparative analysis of supervised learning techniques and Camila F. T. Pontes is a student at the University
fisher score feature selection algorithm,” in International Symposium on of Brasilia (UnB), Brasilia, DF, Brazil. She received
Computer and Information Sciences. Springer, 2018, pp. 141–149. her M.Sc. degree in Molecular Biology in 2016 from
[27] J. Li, M. Liu, Z. Xue, X. Fan, and X. He, “Rtvd: A real-time volumetric UnB and is currently an undergrad student at the
detection scheme for ddos in the internet of things,” IEEE Access, vol. 8, Department of Computer Science (CIC/UnB). Her
pp. 36 191–36 201, 2020. research interests are Computational and Theoretical
[28] Y. Jia, F. Zhong, A. Alrawais, B. Gong, and X. Cheng, “Flowguard: Biology and Network Security.
An intelligent edge defense mechanism against iot ddos attacks,” IEEE
Internet of Things Journal, 2020.
[29] M. P. Novaes, L. F. Carvalho, J. Lloret, and M. L. Proença, “Long short-
term memory and fuzzy logic for anomaly detection and mitigation
in software-defined network environment,” IEEE Access, vol. 8, pp.
83 765–83 781, 2020.
[30] A. AlEroud and G. Karabatis, “A contextual anomaly detection approach
to discover zero-day attacks,” in 2012 International Conference on Cyber Manuela M. C. de Souza is an undergrad Computer
Security, 2012, pp. 40–45. Science student at University of Brasilia (UnB),
[31] D. Plonka, “Flowscan: A network traffic flow reporting and visualization Brasilia, DF, Brazil. Her research interest is Network
tool.” in LISA, 2000, pp. 305–317. Security.
[32] A. H. Lashkari, G. Draper-Gil, M. S. I. Mamun, and A. A. Ghorbani,
“Characterization of tor traffic using time based features.” in ICISSP,
2017, pp. 253–262.
[33] S. Cocco, C. Feinauer, M. Figliuzzi, R. Monasson, and M. Weigt,
“Inverse statistical physics of protein sequences: A key issues review,”
Reports on Progress in Physics, vol. 81, no. 3, p. 032601, mar 2018.
[34] F. Morcos, A. Pagnani, B. Lunt, A. Bertolino, D. S. Marks, C. Sander,
R. Zecchina, J. N. Onuchic, T. Hwa, and M. Weigt, “Direct-coupling
analysis of residue coevolution captures native contacts across many
protein families,” Proceedings of the National Academy of Sciences of
the United States of America, vol. 108, no. 49, pp. E1293–E1301, dec João J. C. Gondim was awarded an M.Sc. in
2011. Computing Science at Imperial College, University
[35] F. Y. Wu, “The Potts model,” Reviews of Modern Physics, vol. 54, no. 1, of London, in 1987 and a Ph.D. in Electrical En-
pp. 235–268, jan 1982. gineering at UnB (University of Brasilia, 2017). He
[36] E. T. Jaynes, “Information theory and statistical mechanics. II,” Physical is an adjunct professor at Department of Computing
Review, vol. 108, no. 2, pp. 171–190, may 1957. Science (CIC) at UnB where he is a tenured mem-
[37] B. Giraud, J. M. Heumann, and A. S. Lapedes, “Superadditive correla- ber of faculty. His research interests are network,
tion,” Physical Review E, vol. 59, no. 5, p. 4983, 1999. information and cyber security.
[38] A. Georges and J. S. Yedidia, “How to expand around mean-field theory
using high-temperature expansions,” Journal of Physics A: Mathematical
and General, vol. 24, no. 9, p. 2173, 1991.
[39] A. L. Buczak and E. Guven, “A survey of data mining and machine
learning methods for cyber security intrusion detection,” IEEE Commu-
nications surveys & tutorials, vol. 18, no. 2, pp. 1153–1176, 2015.
[40] J. A. Hartigan and M. A. Wong, “Algorithm as 136: A k-means Matt Bishop received his Ph.D. in computer science
clustering algorithm,” Journal of the Royal Statistical Society. Series from Purdue University, where he specialized in
C (Applied Statistics), vol. 28, no. 1, pp. 100–108, 1979. computer security, in 1984. His main research area is
[41] J. R. Quinlan, “Simplifying decision trees,” International journal of the analysis of vulnerabilities in computer systems.
man-machine studies, vol. 27, no. 3, pp. 221–234, 1987. The second edition of his textbook, Computer Se-
[42] P. H. Swain and H. Hauska, “The decision tree classifier: Design and curity: Art and Science, was published in 2002 by
potential,” IEEE Transactions on Geoscience Electronics, vol. 15, no. 3, Addison-Wesley Professional. He is currently a co-
pp. 142–147, 1977. director of the Computer Security Laboratory at the
[43] W. S. McCulloch and W. Pitts, “A logical calculus of the ideas immanent University of California Davis.
in nervous activity,” The bulletin of mathematical biophysics, vol. 5,
no. 4, pp. 115–133, 1943.
[44] D. D. Lewis, “Naive (bayes) at forty: The independence assumption
in information retrieval,” in European conference on machine learning.
Springer, 1998, pp. 4–15.
[45] C. Cortes and V. Vapnik, “Support-vector networks,” Machine learning, Marcelo Antonio Marotta is an adjunct professor at
vol. 20, no. 3, pp. 273–297, 1995. the University of Brasilia, Brasilia, DF, Brazil. He
[46] Y. Freund, R. E. Schapire et al., “Experiments with a new boosting received his Ph.D. degree in Computer Science in
algorithm,” in icml, vol. 96. Citeseer, 1996, pp. 148–156. 2019 from the Institute of Informatics (INF) of the
[47] V. Svetnik, A. Liaw, C. Tong, J. C. Culberson, R. P. Sheridan, and Federal University of Rio Grande do Sul (UFRGS),
B. P. Feuston, “Random forest: a classification and regression tool Brazil. His research involves Heterogeneous Cloud
for compound classification and qsar modeling,” Journal of chemical Radio Access Networks, Internet of Things, Soft-
information and computer sciences, vol. 43, no. 6, pp. 1947–1958, 2003. ware Defined Radio, Cognitive Radio Networks, and
[48] N. Japkowicz and M. Shah, Evaluating learning algorithms: a classifi- Network Security.
cation perspective. Cambridge University Press, 2011.
[49] D. Brzezinski and J. Stefanowski, “Prequential auc: properties of the area
under the roc curve for data streams with concept drift,” Knowledge and
Information Systems, vol. 52, no. 2, pp. 531–562, 2017.