Data-Driven Cybersecurity Knowledge Graph Construc

Hindawi
Wireless Communications and Mobile Computing

Volume 2020, Article ID 8883696, 13 pages
https://doi.org/10.1155/2020/8883696
Research Article
Data-Driven Cybersecurity Knowledge Graph Construction for
Industrial Control System Security
Guowei Shen ,1,2,3 Wanling Wang,1 Qilin Mu,2,3 Yanhong Pu,2,3 Ya Qin,1 and Miao Yu 4
1
Guizhou Provincial Key Laboratory of Public Big Data, College of Computer Science and Technology, Guizhou University,
Guiyang 550025, China
2
Big Data Application on Improving Government Governance Capabilities National Engineering Laboratory, Guiyang 550022, China
3
CETC Big Data Research Institute Co., Ltd., Guiyang 550022, China
4
Institute of Information Engineering, Chinese Academy of Sciences, Beijing 100093, China
Correspondence should be addressed to Miao Yu; yumiao@iie.ac.cn
Received 14 July 2020; Revised 22 September 2020; Accepted 31 October 2020; Published 28 December 2020
Academic Editor: Ding Wang
Copyright © 2020 Guowei Shen et al. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Industrial control systems (ICS) involve many key industries, which once attacked will cause heavy losses. However, traditional
passive defense methods of cybersecurity have difficulty effectively dealing with increasingly complex threats; a knowledge graph
is a new idea to analyze and process data in cybersecurity analysis. We propose a novel overall framework of data-driven
industrial control network security defense, which integrated fragmented multisource threat data with an industrial network
layout by a cybersecurity knowledge graph. In order to better correlate data to construct a knowledge graph, we propose a
distant supervised relation extraction model ResPCNN-ATT; it is based on a deep residual convolutional neural network and
attention mechanism, reduces the influence of noisy data in distant supervision, and better extracts deep semantic features in
sentences by using deep residuals. We empirically demonstrate the performance of the proposed method in the field of general
cybersecurity by using dataset CSER; the model proposed in this paper achieves higher accuracy than other models. And then,
the dataset ICSER was used to construct a cybersecurity knowledge graph (CSKG) on the basis of analyzing specific industrial
control scenarios, visualizing the knowledge graph for further security analysis to the industrial control system.
1. Introduction endanger the economy, public safety, human life, and other
aspects [3]. With the support of 5G technology, the industrial
Industrial control systems (ICS), which involve key indus- Internet will be integrated with the development of 5G [4],
tries such as oil and gas production, electricity, chemical which promotes industrial development while introducing
processing, transportation, and manufacturing, have seen more security risks, so it is necessary to further improve the
increasing security problems and cyberattacks in recent years guarantee of industrial network security.
due to access to the Internet, such as Stuxnet. Stuxnet [1] Data-driven prediction and analysis of cybersecurity
infected and manipulated programmable logic controller incidents is a hot topic in current cybersecurity research;
(PLC) and caused serious physical damage to equipment through mining correlations among industrial control
which led to system failure. In 2016, the power system of network data, the asset equipment information of the indus-
Ukraine was attacked by a variant of the BlackEnergy trial control system can be associated with corresponding
malicious code [2], resulting in a large-scale power outage vulnerabilities, to identify the potential internal and external
that affected 225,000 citizens. An industrial control network threat relationship with fine granularity and construct the
involves a lot of important infrastructure construction; in asset threat graph based on a specific industrial control
the event of a cyberattack, huge losses will be caused and network structure. It is more explicit to see threat situation
2 Wireless Communications and Mobile Computing
in security analysis of ICS by using visualization technology, Power and Intelligent Control Testbed (EPIC) and
which provides accurate support for industrial control use dataset ICSER to construct a cybersecurity
network security protection decision-making. Currently, knowledge graph for EPIC, visualizing the knowl-
there are numerous open source threat intelligence sources edge graph for further security analysis to the indus-
periodically updating threat feeds fed into various analytical trial control system
solutions. Security news, security forums, and vulnerability
information are important data sources for cyberthreat The rest of the paper is organized as follows. We describe
intelligence. However, the above data is fragmented, and it related works in Section 2 and propose the overall framework
is difficult to correlate such multisource data. in Section 3. The structure definition of CSKG is analyzed in
A cybersecurity knowledge graph (CSKG) is a powerful Section 4. The cybersecurity relation extraction model and
tool for data-driven thread intelligence computing. details are shown in Section 5, and performance evaluation of
Researchers can intuitively know cybersecurity entities and the model is discussed in Section 6. In Section 7, we construct
relations between the entities through CSKG, such as utiliza- and visualize a cybersecurity knowledge graph based on a spe-
tion relation between malware and vulnerabilities, employ- cific industrial control scenario. Section 8 draws conclusions.
ment relation between attackers and organizations, and
ownership between software and vulnerabilities. Relation 2. Related Work
extraction is a very important task in the construction of
CSKG from unstructured data. Industrial control systems (ICS) consist of integrated hardware
In relation extraction, the lack of labeled data for training is and software components for monitoring and controlling
a challenge when constructing a network security knowledge various industrial processes, often deployed in critical infra-
graph. A common technique for coping with this difficulty is structure such as water treatment plants, power grids, and gas
distant supervision in natural language processing. Distant pipelines [5]. In recent years, more and more components of
supervision strategy is an effective method of automatically ICS are connected to the Internet, exposing more and more
labeling training data. However, the assumption in the distant security vulnerabilities that may be exploited by attackers [6].
supervision method is too strong, leading to the wrong label Various vulnerabilities in Internet are important internal
problem. causes of network security risks. There are vulnerabilities in
In this paper, we first propose a novel overall framework all levels and links of the information network; once
of data-driven industrial control network security defense. In exploited by malicious actors, they will affect normal opera-
order to better mine entity relations in cybersecurity data, we tion of the system and its services [7]. Due to the increasing
propose a novel cybersecurity relation extraction model number of attack events and the serious consequences of
ResPCNN-ATT which combined Residual Learning, Piece- attacking, and the many threats in the complex industrial
wise Convolutional Neural Networks (PCNN), and multi- network environment [8, 9], it is crucial to study industrial
instance ATTention. The following list details the main network security. Traditional passive defense measures of
contributions of the article: cybersecurity have the difficulty of effectively dealing with
the increasingly complex threats; we must strengthen cyber-
(i) A novel data-driven industrial network security security analysis capability based on vulnerabilities, threat
defense framework is proposed, which structures intelligence, and other aspects and enhance the industrial
fragmented multisource data and integrates with network security active defense capability.
industrial network layout Structuring and organizing data can improve the
efficiency and accuracy of cybersecurity analysis. Sadighian
(ii) A distant supervised cybersecurity relation extrac-
et al. [10] proposed ONTIDS, an ontology alarm association
tion model based on ResPCNN-ATT is proposed
framework based on context information. By defining the
to reduce the impact of noise data in open source
ontology structure, security alarms are represented and
threat intelligence data sources
stored, and the association between alarm information is
(iii) ResPCNN-ATT first uses the pretrained word vector regularized; on this basis, rules are set to filter alarms to
and the position vector between cybersecurity entity reduce the false alarm rate and facilitate network security
pairs as the model input and then uses PCNN to analysis. In order to further achieve cybersecurity informa-
extract the semantic features. Deep residual learning tion correlation and semantic analysis, many researches
is used to solve the problem of gradient disappear- are devoted to improving the interpretation, feature
ance caused by noise data. A multi-instance atten- correlation, and data processing of the alarm log, reducing
tion mechanism is used to calculate the correlation the false alarm rate, and enhancing cybersecurity analysis
between instance and the corresponding relation to capability [11–13].
reduce the impact of noise data Data-driven cybersecurity event prediction and analysis
are hot topics in the current cybersecurity research [14].
(iv) The datasets CSER and ICSER are constructed. We Shu et al. introduced a new methodology that models threat
first empirically demonstrate the performance of discovery as a graph computation problem for threat
the proposed method in the field of general cyberse- intelligence [15]. Yu et al. proposed a relation extraction
curity by using dataset CSER. And then, we analyze method for the construction of a knowledge graph in the
asset information and network layout of Electric food field [16]. As a semantic knowledge base, a knowledge
Wireless Communications and Mobile Computing 3
Cybersecurity analysis
Precise decision- Hidden threat Knowledge
making for correlation iteration to assist
asset security prediction decision-making
Visualization Cypher
Neo4j database
Cybersecurity knowledge graph (CSKG) construction
Cybersecurity Cybersecurity Cybersecurity

entity extraction relation extraction KG storage
Elasticsearch
database Focused
Data processing
crawling
Data sources
Asset equipment Network layout Vulnerability database
Figure 1: The overall framework of data-driven industrial control network security analysis.
graph is a powerful tool for managing large-scale knowledge 3. Overall Framework

consisting of entities and relations between them. Using a
knowledge graph to analyze and process data provides a There are numerous open source threat intelligence sources
new idea for cybersecurity analysis, integrates open source periodically updating threat feeds fed into various analytical
fragmented data, identifies its correlation, associates asset solutions; it is significant for cybersecurity analysis that
equipment in ICS with corresponding vulnerability infor- structures these data and applies them to specific scenarios.
mation, excavates the internal and external potential threat As shown in Figure 1, we propose a data-driven industrial
relation, and further conducts more accurate analysis on control network security analysis framework based on a
industrial control network security. It is crucial to mine cybersecurity knowledge graph. We combine threat intelli-
the association of data resources efficiently and accurately. gence such as third party attack reports and vulnerability
Natural language processing technology [17–19] tends to libraries with asset network layouts, and so, internal network
only consider the domain name and IP address when analyz- layout and threat information corresponding to assets in
ing the relation between malicious entities, both of which networks are integrated with external threat intelligence. A
have very simple relation definitions. Pingle et al. proposed knowledge graph extends the problem of cybersecurity
the RelExt [20] system, which strives to improve various analysis to the study of the graph structure; graph-based
cyberthreat representation schemes, especially cybersecurity analysis is conducive to the development of effective system
knowledge graphs (CSKG), by predicting the relations protection, detection, and response mechanisms.
between cybersecurity entities identified by cybersecurity We first analyze ICS scenarios to identify asset equip-
named entity recognizer. VIEM [21] analyzed a large number ment and communication layout. On this basis, we mine
of inconsistencies by extracting software names and software external vulnerability information from vulnerability librar-
versions in public security vulnerability reports, so the ies such as Cybersecurity and Infrastructure Security Agency
extraction of relations is more complicated. (CISA) (https://www.us-cert.gov/ics), National Vulnerability
Relation extraction (RE) is one of the most important Database (NVD) (https://nvd.nist.gov/), Common Weakness
topics in NLP. Many relation extraction methods have been Enumeration (CWE) (https://cwe.mitre.org/), and Common
proposed [22–24], such as bootstrapping, unsupervised Vulnerabilities and Exposures (CVE) (http://cve.mitre.org/).
relation discovery, and supervised classification. Most exist- We collect data by the way of focused crawling and obtain
ing supervised RE methods require a large amount of labeled the key corpus for constructing a knowledge graph after
relation-specific training data, which is very time-consuming processing. And then, we utilize cybersecurity entity iden-
and labor-intensive. Distant supervision is proposed to auto- tification and relation extraction technology to form a
matically generate training data. Under the framework of cybersecurity knowledge graph (CSKG), offering structured
distance supervised learning, some recent work [25–28] analysis data for specific cybersecurity scenarios. Based on
attempts to use deep neural networks in relation prediction. the constructed CSKG, we can use visualization technology
Although distant supervision is an effective strategy to to show the connection between assets and threats clearly;
automatically label training data, it always suffers from the it becomes easier to query entities, relations, and path. We
wrong label problem. further research on the basis of the knowledge graph,
MIED2 GOOSE GOOSE GIED2

Microgrid Generation
HSR1 HSR2
MIED1 MSW2 MSW1 GSW2 GSW1 GIED1
MPLC MAP GAP GPLC
SPLC TPLC
TAP
SAP
SIED3 SSW2 SSW1 CSW1 TSW1 TSW2 TIED3
Transmission
Smart home SIED4 HSR4 GOOSE
GOOSE HSR3
CAP1 TIED2 TIED1
SIED1
SIED2
Historian CSW2 CPLC
AP - access point
IED - intelligent electrical device
CAP2 TWS
SW - network switches
CSW3 - firewall
MMS
SCADA PLC - programmable logic control
CSW3
VSD - variable speed drive
wireless
Figure 2: The communication layout of EPIC.
utilizing knowledge reasoning technology to forecast correla- (https://itrust.sutd.edu.sg/itrust-labs-home/itrust-labs_epic/)

tion of threats and assets, to more comprehensively analyze as a specific industrial control network scenario. We analyze
industrial control network security. the network layout and list the key asset equipment and
We have done a lot of research on the key technologies of resources in EPIC.
the knowledge graph. Information extraction, as a key EPIC is a power testbed that maps a small smart grid sys-
technology of CSKG, is of great significance in the entire archi- tem in real life, including four stages of generation, transmis-
tecture. Cybersecurity entities have the characteristics of mixed sion, microgrid, and smart home; each stage is controlled by
Chinese and English, confusing classification, and unclear its own PLC/controller. There are communication channels
features, and the existing related datasets are also very few, between SCADA, distributed control system (DCS), and
leading to difficulties in cybersecurity entity relation extraction. energy management system (EMS) and each PLC/controller.
For the lack of related datasets, we construct dataset CSER Attackers can exploit vulnerabilities to enter the communi-
for general cybersecurity relation extraction and dataset cation network and maliciously manipulate the control flow
ICSER for industrial control network relation extraction. First, and launch DDos attack on the PLC control flow, and then,
the cybersecurity entity recognition model based on FT-CNN- the system cannot work normally. Attackers can also utilize
BiLSTM-CRF proposed by Qin et al. [29] is used to extract the communication channel to enter the SCADA worksta-
cybersecurity entity pairs. This method uses artificial feature tion and operate on the SMA portal to launch more attacks.
templates to extract local context features and further uses a According to [30], the communication layout of EPIC is
neural network to automatically extract character features shown in Figure 2, which is composed of a SCADA worksta-
and global text features. Cybersecurity entity pairs were used tion, historian, programmable logic controller (PLC), intelli-
to manually annotate some of the relation extraction corpora gent electrical devices (IEDs), access points (APs), and
and match entity pairs with text data from vulnerability switches (SWs), and redundancy in the ring network is
databases to form final datasets. Finally, the cybersecurity rela- achieved using high availability seamless redundancy (HSR)
tion extraction dataset CSER and industrial control network and media redundancy protocol (MRP).
relation extraction dataset ICSER are constructed. EPIC uses the IEC 61850 standard as the communication
protocol for automation systems. There are two main proto-
4. CSKG Structure Definition cols: Manufacturing Message Specification (MMS) and
General Object-Oriented Substation Event (GOOSE). It allows
4.1. Scenario Analysis. In this paper, we take Electric Power data communication between IED, PLC, and SCADA
and Intelligent Control Testbed (EPIC) from iTrust Labs workstations. PLC uses MMS to communicate with SCADA
Figure 3: Ontology structure.
workstations and IEDs and communicate through GOOSE in knowledge graph. The ontology structure we define in this
four stages. The fieldbus communication between physical paper is shown in Figure 3.
process and PLC, Master PLC, and SCADA of each stage is We define 9 relations including model, _have_, version_,
achieved through optional wired and wireless channels. AKA, version, _by_, CVSS_score, module, help_out, and
The key asset resources in EPIC [31] mainly include the conn and additionally define two relations, comm and
following: SCADA system, which uses Pcvue in EPIC and asset_info, to represent the connection relation in the EPIC
runs on a personal computer equipped with the Windows communication network and asset information. There are
operating system; PLCs, which use WAGO’s PLC series 11 relations in total. Use <head, tail, relation> to identify
PFC200 perform logic control in EPIC, located on control the head entity, tail entity, and the relation between them.
and network panel, and work based on firmware and control In this paper, the information of the network layout is
logic programs, and in a few cases, use Modbus TCP/IP com- mapped into triples <asset1, asset2, comm>, such as
munication; Codesys (Codesys v3), which is the program- <MIED1, MIED2, comm>. Furthermore, <asset, Product,
ming standard of PLC; IEDs, SIPROTEC Relays from asset_info> combines the internal network layout and exter-
Siemens for protection and control which is used in EPIC, nal threat intelligence through connecting asset nodes with
located in the control center and uses IEC61850 standard to the product information used by them. Through analysis of
communicate with the rest of the system, and maintains the vulnerability databases, the vulnerability number is associ-
entire process by firmware and control logic; VSD, SEW ated with CVSS score, solution, attack vector, and other
Eurodrive and the corresponding motor which are used as relevant vulnerability numbers, making vulnerability analysis
VSD in EPIC, located in the motor/generator room; and more multidimensional.
network switches and access points located in the network
control panel which adopt HIRSCHMANN products. 5. The Proposed Model
In this section, we describe the architecture of the proposed
4.2. Ontology Structure. Mining EPIC-related vulnerabilities
cybersecurity entity relation extraction model and then intro-
to form a knowledge graph correspond to network layout duce each component of the model in detail.
and asset information of EPIC. For the convenience of Under the framework of distant supervised learning, the
research, the study mainly considers assets involved in the problem of insufficient label data in deep learning can be
communication layout of EPIC. In this paper, we use assets solved, but at the same time, it also brings some problems,
as keywords to collect strong correlation information from such as the low-quality label data and the wrong label data.
vulnerability databases and form a relation extraction corpus This would have a great impact on subsequent tasks of entity
with common vulnerabilities in ICS. The communication relation extraction. In view of the above problems, we
layout in EPIC is mapped into multiple groups of bidirec- propose a distant supervised relation extraction model
tional communication relation between nodes and repre- ResPCNN-ATT based on the deep residual neural network
sented by triples. The connection between internal network and attention mechanism. The framework is shown in
layout and external threat information is established through Figure 4. The model is mainly composed of a vector
the matching between nodes and specific asset information, representation layer, a deep residual convolutional network
thus forming the final industrial control network security layer, and a multi-instance attention layer.
c3
c2
…
c1
f(x)+x
Alies x f(x) 𝛼1
discover
Chrome 𝛼2
+
has
…
…
Struts2 𝛼i
vulnerability
max (c1,3)
…
Residual
Identity shortcut convolution block
Convolution Piecewise max-pooling

Word Position
Vector representation (ResPCNN) Multi-instance attention(ATT)
Figure 4: Cybersecurity relation extraction model based on ResPCNN-ATT.
The model first uses the pretrained word vector and the
c p ⁎2
position vector between entity pairs as input, which can high- d d
light the role of the two entities, and then uses the piecewise
convolutional neural networks to extract semantic features. Alies …… –2 –4
At the same time, deep residual learning is introduced to solve
the problem of gradient disappearance caused by noise data, discover …… –1 –3
so as to extract more effective semantic features. Finally, in
order to better capture the more important semantic features Chrome …… 0 –2
in sentences, the multi-instance attention mechanism is used
to calculate the correlation between instances and correspond- has …… 1 –1
ing relation, so as to reduce the impact of noise data and
improve the performance of relation extraction. XSS …… 2 0
5.1. Vector Representation. The vector representation layer in vulnerability 3 1

……
the model mainly includes word embedding and position
embedding. Figure 5: Position embedding.
5.1.1. Word Embedding. Before training the relation extrac-
tion model, the text data needs to be vectorized so that the
model can read the data. Compared with traditional one-
often close to the entity are more able to highlight the relation
hot coding, word vector mapping can represent more seman-
between the two entities, such as some verbs: attack, use, etc.
tic and syntactic information. Word vector mapping is to
Therefore, in order to make full use of the information in the
map each word in the text to a k-dimensional real-valued
sentence, the position of each word in the sentence for two
vector. It is a distributed representation of words. When
entities is an important feature in the relation extraction task.
training a neural network model, the most common method
This paper uses the position vector (position embeddings
is to randomly initialize all parameters and then use an
(PE)) mapping representation method proposed by Zeng
optimization algorithm to optimize the parameters. Research
et al.; that is, the relative distance between the current word,
shows that when a neural network is initialized with a
entity e1 and entity e2 , is stitched and converted into a vector
pretrained word vector, the parameters can be converged to
representation through embedding. In sentence position
a better local minimum.
vectorization, if the dimension of the word vector is d c and
For a given sentence X = fx1 , x2 , ⋯, xn g consisting of n
the dimension of the position vector is d p , then the dimen-
words, use word2vec to map each word to a low-
sion of the sentence vector is
dimensional real-valued vector space, then perform word
vector processing on the sentence, and finally get a vector
representation of each word in the sentence, to form a word d s = dc + dp ∗ 2: ð1Þ
vector query matrix Dc . Each input training sequence can
be mapped by the word vector query matrix Dc to obtain
the corresponding real-valued vector xt = fw1 , w2 , ⋯, wn g. For example, the vectorized representation of “Alies
discover Chrome has XSS vulnerabilities” is shown in
5.1.2. Position Embedding. In the relation extraction task, we Figure 5, “Chrome” and “XSS” in the sentence correspond
focus on finding the relation of entity pairs. Words that are to entities e1 and entities e2 , respectively. Then, the distance
from “Alies” to “Chrome” is 2, the distance from “Alies” to where ai is the weight of the input instance vector gi , which
“XSS” is 4, the distance from “vulnerability” to “Chrome” is measures the correlation of the corresponding relation r.
-3, and the distance from “vulnerability” to “XSS” is -1. The calculation formula of αi is as follows:
5.2. Deep Residual Neural Network. In cybersecurity relation exp ðei Þ
extraction tasks, the main challenge is that the length of the αi = : ð5Þ
∑k expðek Þ
input sentence is variable and not fixed, and important fea-
ture information may appear in any area of the sentence.
Therefore, in order to be able to use all local features and ei is a query-based function, which indicates the degree of
predict relations globally, this paper uses a piecewise convo- matching between the input instance vector gi and the
lutional neural network PCNN model to extract semantic prediction relation r.
features in sentences. Conditional probability of prediction relation pðR ∣ SÞ is
In this paper, a residual convolution block is designed for calculated by softmax function:
residual learning. Each residual convolution block is a
sequence composed of two convolution layers. After each pðR ∣ SÞ = soft max ð~r S + bÞ, ð6Þ
convolution layer, the activation function ReLU is used for
nonlinear mapping, and features are then extracted using a where ~r is the relation matrix and b represents the bias vector.
local maximum pool. The kernel size of all convolution pðR ∣ SÞ is used to predict the relation between pairs of cyber-
operations in the residual convolution module is w, and security entities:
the newly generated features are guaranteed to be the
same size as the original ones through the border padding ~ = arg max pðR ∣ SÞ:
R ð7Þ
operation. The convolution kernels of the two-layer convo-
lution are W 1 , W 2 ∈ Rw∗1 . The first layer of the residual 6. Performance Evaluation
convolution block is
In this section, we empirically demonstrate the performance
of the proposed method on datasets CSER and ICSER. Com-
ci,1 = f ðW 1 ⋅ ci,i+w−1 + b1 Þ: ð2Þ monly used Precision-Recall (P‐R) curve, AUC value, and
average accuracy (P@N) are used to evaluate the model.
The second layer is The P‐R curve is a curve drawn with the recall rate R as the
abscissa and the accuracy rate P as the ordinate, using P
and R at different confidence levels. The AUC value is the
ci,2 = f ðW 2 ⋅ ci,i+w−1 + b2 Þ, ð3Þ area included under the P‐R curve. Generally, the larger the
AUC value is, the better the model performs. P@N is the
where b1 , b2 are bias vectors. In this paper, we optimize accuracy rate calculated by comparing the first N relation
the residual learning to get the output vector c of the instances.
residual convolution block [32, 33].
After the semantic feature is acquired by the convolution 6.1. Datasets and Parameters. In order to verify the perfor-
layer, the most representative local feature is further mance of our proposed model, we build a cybersecurity entity
extracted by the pooling layer. In order to capture character- relation (CSER) dataset. 10 types of relations were labeled.
istic information of different sentence structures, a piecewise The dataset CSER is clawed from the Freebuf (https://www
max pooling process is used. .freebuf.com/) website and wooyun vulnerability database,
which includes network text data such as technology sharing,
5.3. Multi-Instance Attention. In the relational extraction network security, and vulnerability information.
model, sentence-level attention is built on multiple instances, The set of dimensions of the word vector is f50, 60,⋯,
dynamically reducing the weight of noisy instances, and 300g. The set of dimensions of the position vector is f1,
making full use of semantic information in these sentences 2,⋯,10g. During the training process, the Adam optimizer
to obtain final sentence vector representation. performs optimization training. The value set of the learn-
For the instance set S = ðg1 , g2 , g3 ,⋯,gn Þ describing the ing rate is {0.01, 0.001, 0.0001}. The set of batch sizes
same entity pair <ei , e j > , gi is the instance vector output processed in one iteration is {40, 160, 640, 1280}. In order
by the convolution layer and n is the number of instances to prevent the model from overfitting, the dropout method
contained in the set S. This paper will calculate the correla- is used in CNN. Other parameters are shown in Table 1.
tion degree between the instance vector gi and the relation
6.2. Results and Analysis. The experimental comparison in
r. In order to reduce the impact of noise data and make full this paper mainly compares two aspects of the models.
use of the semantic information contained in each instance On the one hand, it uses the CNN algorithm with differ-
in the set, the calculation of instance set vector S will depend ent performances to encode the training data and extract the
on each instance gi in the set: semantic features in the sentence, mainly including the tradi-
tional models: CNN, PCNN, and ResPCNN.
S = 〠 αi g i , ð4Þ The second aspect is based on how CNN/PCNN/
i ResPCNN uses the information in the packaging bag for
Table 1: Parameters. Precision-recall

1.0
Parameters Value 0.9
CNN window size 3 0.8
CNN hidden size 230 0.7
Learning rate 0.01
Precision
0.6
Batch size 160
0.5
Epoch 60
0.4
Dimension of the position vector 5
Dropout rate 0.5 0.3
Dimension of the word vector 50 0.2

0.1
0.00 0.05 0.10 0.15 0.20 0.25 0.30
Recall
experimental comparison. Three different methods were CNN_AVE

CNN_ONE
used to process the information in the bag, namely, AVE,
CNN_ATT
ONE, and ATT. AVE assigns the same weight to all the sen-
tences in the packet as the entity pair, that is, αi = 1/n. ONE Figure 6: The results of different bag methods AVE/ONE/ATT
means to take the instance vector with the highest confidence based on CNN.
and find a sentence with the highest score from each bag to
represent the entire bag. All models in this paper have been
trained and tested on the dataset CSER. Figures 6–8 show Precision-recall
1.0
the P‐R curves of the results on different bag models. AVE
can introduce more information of sentences, but since it 0.9
has the same evaluation on each sentence, it will also 0.8
introduce noise from the wrong label data, which reduces 0.7
Precision
the performance of relation extraction, so AVE has the lowest 0.6

performance of relation extraction among the bag models. 0.5
The AUC value difference between ONE and ATT on model
0.4
PCNN is 0.12%, which refers that the performance of relation
extraction does not differ much. On model ResPCNN and 0.3
CNN, the performance of relation extraction of ATT is 0.2
slightly higher than that of ONE; ATT can achieve a higher 0.1
accuracy rate throughout the recall scope. 0.00 0.05 0.10 0.15 0.20 0.25 0.30
Recall
From Figure 9, the AUC value of the model ResPCNN-
ATT is the highest value on the dataset CSER, which reaches PCNN_AVE
PCNN_ONE
12.68%. The model ResPCNN-ATT proposed in this paper
PCNN_ATT
can better extract the deep semantic information of
sentences, indicating that the introduction of the ATT Figure 7: The results of different bag methods AVE/ONE/ATT
method can effectively reduce the redundant data in distant based on PCNN.
supervised learning.
As can be seen from Table 2, comparing the accuracy of Figure 10 shows the P‐R curves on models with different
the first 100, 200, and 300 relation instances on the dataset depths.
CSER, the relation extraction accuracy of ResPCNN-ATT is
the highest, which reaches 32.67%. However, the accuracy 7. CSKG Construction and Visualization for ICS
of the CSER dataset is lower than other datasets. This is
because the sentences in the CSER dataset are mixed with The proposed model ResPCNN-ATT performs well on the
Chinese and English; the more complicated the sentence dataset CSER, and further, we apply ResPCNN-ATT to the
structure is, the less obvious the entity relation characteristics relation extraction task in the construction of a knowledge
are, and the less the corpus data is. graph for EPIC.
In order to further analyze the relation extraction model
proposed in this paper, by adding the depth of the ResPCNN- 7.1. Relation Extraction. We analyze key assets and the
ATT model to verify the effectiveness of the introduction of communication relation between the assets in EPIC and
residual learning, comparative experiments of convolutional obtained datasets through labeling in distant supervision.
layers with different depths are designed. In this paper, the Due to the need for strong data correlation, after filtering
number of convolutional layers is increased by increasing and cleaning, 19,838 examples of industrial control network
the number of residual convolution blocks, and the experi- security entity relations were finally formed. 15,937 sentences
mental comparison is performed on the CSER dataset. were randomly selected as training data, which included 3838
Precision-recall Precision-recall
1.0 1.0
0.9 0.9
0.8 0.8
0.7 0.7
Precision
Precision
0.6 0.6
0.5 0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.1
Recall 0.00 0.05 0.10 0.15 0.20 0.25 0.30
Recall
ResPCNN_AVE
ResPCNN_ONE CNN-5_ATT ResPCNN-3_ATT
ResPCNN_ATT PCNN-5_ATT ResPCNN-7_ATT
ResPCNN-3_ATT ResPCNN-9_ATT
Figure 8: The results of different bag methods AVE/ONE/ATT
based on ResPCNN. Figure 10: The results on models with different depths.
Precision-recall
1.0 Precision-recall
1.0
0.9
0.9
0.8
0.7 0.8
Precision
0.6
Precision
0.7
0.5
0.4 0.6
0.3
0.5
0.2
0.4
0.1
0.00 0.05 0.10 0.15 0.20 0.25 0.30
Recall 0.3
0.00 0.05 0.10 0.15 0.20 0.25
CNN_ATT Recall
PCNN_ATT
ResPCNN_ATT ResPCNN-3_ATT
ResPCNN-5_ATT
Figure 9: The results of different sentence semantic feature
extraction models CNN/PCNN/ResPCNN. Figure 11: The results of ResPCNN-ATT with different depths on
dataset ICSER.
Table 2: Results for the first 100, 200, and 300 extracted relation entity pairs, and 4001 sentences were selected as test data,
instances upon manual evaluation.
which included 876 entity pairs.
Models P@100 P@200 P@300 Mean AUC In this paper, when the depth of the ResPCNN-ATT
model is 3 and 5, respectively, an experiment is carried out
CNN+AVE 0.3267 0.2537 0.2452 0.2743 0.1062
on dataset ICSER, corresponding to different layers of
CNN+ONE 0.2971 0.3035 0.2392 0.2799 0.1096 convolution layers. Figure 11 shows the P‐R curves at differ-
CNN+ATT 0.3267 0.2437 0.2425 0.2710 0.1121 ent depths. The P‐R curves above show the effectiveness of
PCNN+AVE 0.2971 0.2587 0.2645 0.2727 0.1096 introducing residual learning when the model depth is
PCNN+ONE 0.3168 0.2587 0.2358 0.2705 0.1109 shallow such as 3 and 5.
PCNN+ATT 0.3267 0.2736 0.2525 0.2842 0.1121 Table 3 shows the prediction accuracy and AUC values of
ResPCNN+AVE 0.3267 0.2686 0.2458 0.2804 0.1205 the test set in the first 100, 200, and 300 relation instances of
the model at two depths. Based on the complex industrial
ResPCNN+ONE 0.3564 0.2786 0.2558 0.2969 0.1184
control network security dataset, the model has performed
ResPCNN+ATT 0.4158 0.3084 0.2558 0.3267 0.1268 well.
Table 3: Results for the first 100, 200, and 300 extracted relation instances.
Models P@100 P@200 P@300 Mean AUC

ResPCNN-3_ATT 0.6237 0.4726 0.4252 0.5072 0.2277
ResPCNN-5_ATT 0.6435 0.5174 0.4850 0.5486 0.2343
CVE-20...
SVUIGrd...
CEV-20...
ActiveX
e
CVE-20... dul
Mo
Pcvue
CSW3 HMI
ARC
le
informa...
u
Mod
Eagle
TWS SCADA
works...
Pcvue
MSW1
BAT-R
CAP2
TAP TSW2
CVE-20...
MSW2
Hirschm...
RSL
Figure 12: Part of relations of asset node SCADA workstation.
7.2. Visualization and Analysis. Finally, 3878 relationships can help to troubleshoot equipment failures and strengthen
are extracted and stored. Asset as an entity has the communi- security status. The asset vulnerability corresponding to the
cation relation between other assets in network layout. One vulnerability, such as the port number used, is associated
specific asset node matches one asset equipment at least; with the exploit relationship.
through brands, models, or components used by asset equip- The preliminary construction of the EPIC industrial
ment, the corresponding vulnerability information can be control network security knowledge graph not only facilitates
connected with the asset. A part of the relations of asset node daily management, daily maintenance, and network security
SCADA workstation is shown in Figure 12. analysis but also supports the completion of downstream
The versions, components, and vulnerabilities of WAGO tasks of the knowledge graph. The knowledge expression
RFC200 series of products used by PLC in EPIC can be seen form in the knowledge graph is simple, intuitive, flexible,
in Figure 13. The correlation between different vulnerabilities and rich. Based on the existing knowledge graph structure,
is defined, such as the correlation between vulnerabilities we can deepen the industrial control network security
from CVE and CWE, which enables the network analysis to defense at a deeper level and make network security defense
locate the source code faster and more accurately. research more diversified. Further, through knowledge
As shown in Figure 14, the CVSS score can quantify the reasoning, we can link to hidden entities and predict new
vulnerability threat level; information such as vulnerability relationships. It helps find out new attack behaviors and
solutions, patch links, and security recommendations is improve the richness and accuracy of the knowledge graph.
structurally related to the corresponding vulnerability, which The mining of entities and relationships offers constant
10 1SAP12...
6ES7211...
CVE-20... CVE-20...
ICSA-20...
ICSA-18... 6ED105...
CVE-20...
Web-GUI ICSA-19...
Module SIMATIC
S7-400
750-881
I/O-CHE... ICSA-19...
CWE-400 CVE-20...
0758-08...
PFC100 750-81xx
Module
I/O-Che...
CVE-20...
02.07.07... ICSA-20... Control
BeagleB...
Win V3
ule
CODESYS
od
M
Mod
ule
PFC200
CWE-522 750-87x
ICSA-19...
750-8202
750-82xx
CODESYS Beckhoff CVE-20...

03.01.07
CX
HMI V3
Figure 13: Part of relations of WAGO RFC200.
ICSA-17…
CVE-20…
v4.30 v4.33
conn
http://gl…
MMS
on_
versio
versi
n_
9.8 ul
e
CWE-287
conn
od
m
help
_ou
t
CV IEC DNP3
SS
_S 61850
nn
co
re
co
ve_
CV
http://w… _ha
SS_
CV
S co
CVE-20…
SS
ICSA-19…
re
_S
ICSA-18…
c
he
or
lp_
e
ou
t PROFIN…
ve_
_have_
_ha
https://s… help_out CWE-23

_have_
CVE-20…
nn
co
_ha
ve_
_have_
t CVE-20…
_ou n
help con
ut
nn
help_o
co
EN100
t Ethernet CWE-119
SINAMI… _ou
help module
ICSA-19… con
n
ut
_o
ICSA-18…
lp
ICSA-19…
he
EKI-136
n_
_hav
io
e_
rs
ve
le
Ethernet
du
_h
module
mo
Module
co
ave
modul
nn
_
co
e
https://s…
nn
_have_
v4.37 CWE-79
IEC104
PROFIN…
CWE-20
DNP3 DNP3
TCP TPOP TCP
Figure 14: Example of vulnerability information.

supplement for the existing knowledge graph and makes 512 project promotion program,” Network Security and Infor-
sense in decision-making, to enhance the active defense matization, vol. 1, pp. 23-24, 2020.
capability of industrial control network security. [5] C. Feng, V. R. Palleti, A. Mathur, and D. Chana, “A systematic
framework to generate invariants for anomaly detection in
8. Conclusions industrial control systems,” in Proceedings 2019 Network and
Distributed System Security Symposium, San Diego, CA, USA,
In this paper, we propose a novel data-driven industrial 2019.
network security defense framework, which structures [6] S. McLaughlin, C. Konstantinou, X. Wang et al., “The cyberse-
fragmented multisource data and integrates these threat data curity landscape in industrial control systems,” Proceedings of
with the industrial network structure. In order to better mine the IEEE, vol. 104, no. 5, pp. 1039–1057, 2016.
entity relations in cybersecurity data, we introduce a novel [7] H. Holm, M. Karresand, A. Vidström, and E. Westring, “A
distant supervised cybersecurity relation extraction model survey of industrial control system testbeds,” in Secure IT Sys-
ResPCNN-ATT. The experimental results show that the tems, pp. 11–26, Springer International Publishing, Cham,
model proposed in this paper has the highest accuracy of 2015.
relation extraction compared with other model methods on [8] C. Wang, D. Wang, Y. Tu, G. Xu, and H. Wang, “Understand-
cybersecurity datasets. Further, based on specific industrial ing node capture attacks in user authentication schemes for
control network security scenarios, we constructed an ICS wireless sensor networks,” IEEE Transactions on Dependable
and Secure Computing, p. 1, 2020.
security knowledge graph by applying ResPCNN-ATT,
which strengthens the cybersecurity analysis capabilities. In [9] D. Wang, W. Li, and P. Wang, “Measuring two-factor authen-
tication schemes for real-time data access in industrial wireless
the future, we intend to introduce reinforcement learning
sensor networks,” IEEE Transactions on Industrial Informat-
to the model to further reduce the impact of noise and study ics, vol. 14, no. 9, pp. 4081–4092, 2018.
the downstream application tasks of the industrial control
[10] A. Sadighian, J. M. Fernandez, A. Lemay, and S. T. Zargar,
network security knowledge graph to strengthen the indus-
“Ontids: A highly flexible context-aware and ontology-based
trial control network security defense capabilities. alert correlation framework,” in Foundations and Practice of
Security. FPS 2013, J. Danger, M. Debbabi, J. Y. Marion, J.
Data Availability Garcia-Alfaro, and N. Zincir Heywood, Eds., vol. 8352 of Lec-
ture Notes in Computer Science, pp. 161–177, Springer, Cham,
All the data used to support this study were supplied by 2014.
Guowei Shen under license and so cannot be made freely [11] R. Shittu, A. Healing, R. Ghanea-Hercock, R. Bloomfield, and
available. Requests for access to these data should be made M. Rajarajan, “Intrusion alert prioritisation and attack detec-
to Guowei Shen (gwshen@gzu.edu.cn). tion using post-correlation analysis,” Computers & Security,
vol. 50, pp. 1–15, 2015.
Conflicts of Interest [12] Y. Yao, Z. Wang, C. Gan et al., “Multi-source alert data under-
standing for security semantic discovery based on rough set
The authors declare that there is no conflict of interest theory,” Neurocomputing, vol. 208, pp. 39–45, 2016.
regarding the publication of this paper. [13] A. A. Ramaki, A. Rasoolzadegan, and A. G. Bafghi, “A system-
atic mapping study on intrusion alert analysis in intrusion
Acknowledgments detection systems,” ACM Computing Surveys, vol. 51, no. 3,
pp. 1–41, 2018.
This work is supported by the National Natural Science [14] N. Sun, J. Zhang, P. Rimba, S. Gao, L. Y. Zhang, and Y. Xiang,
Foundation of China under Grant 61802081 and Big Data “Data-driven cybersecurity incident prediction: a survey,”
Application on Improving Government Governance Capa- IEEE Communications Surveys & Tutorials, vol. 21, no. 2,
bilities National Engineering Laboratory Open Fund Project pp. 1744–1772, 2019.
(No.W-2018023). [15] X. Shu, F. Araujo, D. L. Schales et al., “Threat intelligence com-
puting,” in Proceedings of the 2018 ACM SIGSAC Conference
References on Computer and Communications Security, pp. 1883–1898,
Toronto, Canada, 2018.
[1] N. Falliere, L. O. Murchu, and E. Chien, “W32. Stuxnet dos- [16] H. Yu, H. Li, D. Mao, and Q. Cai, “A relationship extraction
sier,” White paper, Symantec Corporation Security Response, method for domain knowledge graph construction,” World
vol. 5, no. 6, p. 29, 2011. Wide Web, vol. 23, no. 2, pp. 735–753, 2020.
[2] I. C. S. C. Alert, Cyber-attack against Ukrainian critical [17] X. Liao, K. Yuan, X. F. Wang, Z. Li, L. Xing, and R. Beyah,
infrastructure. Cybersecurity Infrastructure Security Agency, “Acing the ioc game: toward automatic discovery and analysis
Technical Report ICS Alert (IR-ALERT-H-16-056-01), of open-source cyber threat intelligence,” in Proceedings of the
Washington, DC, USA, 2016. 2016 ACM SIGSAC Conference on Computer and Communica-
[3] K. Coffey, R. Smith, L. Maglaras, and H. Janicke, “Vulnerabil- tions Security, pp. 755–766, Vienna, Austria, 2016.
ity analysis of network scanning on SCADA systems,” Security [18] G. Siracusano, M. Trevisan, R. Gonzalez, and R. Bifulco,
and Communication Networks, vol. 2018, Article ID 3794603, “Poster: on the application of NLP to discover relationships
21 pages, 2018. between malicious network entities,” in Proceedings of the
[4] L. Zhen, “Cultivate the 5G+ industrial internet to promote 2019 ACM SIGSAC Conference on Computer and Communica-
mutual progress-interpretation of "5G+ industrial internet" tions Security, pp. 2641–2643, London, United Kingdom, 2019.
[19] Z. Zhu and T. Dumitras, “Chainsmith: automatically learn- [32] Y. Y. Huang and W. Y. Wang, “Deep residual learning for
ing the semantics of malicious campaigns by mining threat weakly-supervised relation extraction,” 2017, https://arxiv
intelligence reports,” in 2018 IEEE European Symposium on .org/abs/1707.08866.
Security and Privacy (EuroS&P), pp. 458–472, London, UK, [33] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
2018. image recognition,” in 2016 IEEE Conference on Computer
[20] A. Pingle, A. Piplai, S. Mittal, A. Joshi, J. Holt, and R. Zak, Vision and Pattern Recognition (CVPR), pp. 770–778, Las
“RelExt: relation extraction using deep learning approaches Vegas, NV, USA, 2016.
for cybersecurity knowledge graph improvement,” in Proceed-
ings of the 2019 IEEE/ACM International Conference on
Advances in Social Networks Analysis and Mining, pp. 879–
886, Vancouver, British Columbia, Canada, 2019.
[21] Y. Dong, W. Guo, Y. Chen, X. Xing, Y. Zhang, and G. Wang,
“Towards the detection of inconsistencies in public security
vulnerability reports,” in 28th {USENIX} Security Symposium
({USENIX} Security 19), pp. 869–885, Santa Clara, CA, USA,
2019.
[22] R. Socher, B. Huval, C. D. Manning, and A. Y. Ng, “Semantic
compositionality through recursive matrix-vector spaces,”
in Proceedings of the 2012 Joint Conference on Empirical
Methods in Natural Language Processing and Computational
Natural Language Learning, pp. 1201–1211, Jeju Island,
Korea, 2012.
[23] Z. Daojian, L. Kang, L. Siwei, G. Zhou, and J. Zhao, “Relation
classification via convolutional deep neural network,” in Pro-
ceedings of COLING 2014, the 25th International Conference
on Computational Linguistics: Technical Papers, pp. 2335–
2344, Dublin, Ireland, 2014.
[24] P. Zhou, W. Shi, J. Tian et al., “Attention-based bidirectional
long short-term memory networks for relation classification,”
in Proceedings of the 54th Annual Meeting of the Association
for Computational Linguistics (Volume 2: Short Papers),
pp. 207–212, Berlin, Germany, 2016.
[25] C. N. D. Santos, B. Xiang, and B. Zhou, “Classifying relations
by ranking with convolutional neural networks,” Computer
Science, vol. 86, no. 86, pp. 132–137, 2015.
[26] Y. Lin, S. Shen, Z. Liu, H. Luan, and M. Sun, “Neural relation
extraction with selective attention over instances,” in Proceed-
ings of the 54th Annual Meeting of the Association for Compu-
tational Linguistics (Volume 1: Long Papers), pp. 2124–2133,
Berlin, Germany, 2016.
[27] D. Zeng, K. Liu, Y. Chen, and J. Zhao, “Distant supervision
for relation extraction via piecewise convolutional neural
networks,” in Proceedings of the 2015 Conference on Empir-
ical Methods in Natural Language Processing, pp. 1753–
1762, Lisbon, Portugal, 2015.
[28] P. Qin, W. Xu, and W. Y. Wang, “Robust distant supervision
relation extraction via deep reinforcement learning,” 2018,
https://arxiv.org/abs/1805.09927.
[29] Y. Qin, G. Shen, W. Zhao, Y. P. Chen, M. Yu, and X. Jin, “A
network security entity recognition method based on feature
template and CNN-BiLSTM-CRF,” Frontiers of Information
Technology & Electronic Engineering, vol. 20, no. 6, pp. 872–
884, 2019.
[30] S. Adepu, N. K. Kandasamy, and A. Mathur, “Epic: An electric
power testbed for research and training in cyber physical sys-
tems security,” in Computer Security, SECPRE 2018, Cyber-
ICPS 2018, S. Katsikas, Ed., vol. 11387 of Lecture Notes in
Computer Science, pp. 37–52, Springer, Cham, 2018.
[31] S. Adepu, N. K. Kandasamy, J. Zhou, and A. Mathur, “Attacks
on smart grid: power supply interruption and malicious power
generation,” International Journal of Information Security,
vol. 19, no. 2, pp. 189–211, 2020.

Data-Driven Cybersecurity Knowledge Graph Construc

Uploaded by

Data-Driven Cybersecurity Knowledge Graph Construc

Uploaded by

Hindawi

Wireless Communications and Mobile Computing

Correspondence should be addressed to Miao Yu; yumiao@iie.ac.cn

Academic Editor: Ding Wang

Cybersecurity knowledge graph (CSKG) construction

Cybersecurity Cybersecurity Cybersecurity

Asset equipment Network layout Vulnerability database

graph is a powerful tool for managing large-scale knowledge 3. Overall Framework

MIED2 GOOSE GOOSE GIED2

MPLC MAP GAP GPLC

Figure 2: The communication layout of EPIC.

utilizing knowledge reasoning technology to forecast correla- (https://itrust.sutd.edu.sg/itrust-labs-home/itrust-labs_epic/)

Figure 3: Ontology structure.

Convolution Piecewise max-pooling

Figure 4: Cybersecurity relation extraction model based on ResPCNN-ATT.

5.1. Vector Representation. The vector representation layer in vulnerability 3 1

Table 1: Parameters. Precision-recall

Dimension of the word vector 50 0.2

experimental comparison. Three diﬀerent methods were CNN_AVE

the performance of relation extraction, so AVE has the lowest 0.6

Models P@100 P@200 P@300 Mean AUC

Figure 12: Part of relations of asset node SCADA workstation.

CODESYS Beckhoff CVE-20...

Figure 13: Part of relations of WAGO RFC200.

https://s… help_out CWE-23

Figure 14: Example of vulnerability information.

You might also like