A Pretrained Language Model For Cyber Threat Intelligence

Uploaded by

Radhee R

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

67 views10 pages

A Pretrained Language Model For Cyber Threat Intelligence

Uploaded by

Radhee R

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

A Pretrained Language Model for Cyber Threat Intelligence

Youngja Park Weiqiu You

IBM T. J. Watson Research Center University of Pennsylvania
Yorktown Heights, NY, USA Philadelphia, PA, USA
young_park@[Link] weiqiuy@[Link]

Abstract of the new model remains largely same as that of

the original model. Most domain-specific terms
We present a new BERT model for the cyberse-
are thus out of vocabulary. The pretraining from
curity domain, CTI-BERT, which can improve
the accuracy of cyber threat intelligence (CTI)
scratch approach trains a new tokenizer to con-
extraction, enabling organizations to better de- struct a domain-specific vocabulary and trains the
fend against potential cyber threats. We provide language model using only its own corpus. Beltagy
detailed information about the domain corpus et al. (2019), Gu et al. (2022), and Hu et al. (2022)
collection, the training methodology and its ef- have trained BERT models from scratch for the bio-
fectiveness for a variety of NLP tasks for the medicine, computer science, and political science
cybersecurity domain. The experiments show areas. These studies show that pretraining from
that CTI-BERT significantly outperforms sev-
scratch outperforms the continual pretraining.
eral general-domain and security-domain mod-
els for these cybersecurity applications, indicat- Recently, a few transformers-based LMs
ing that the training data and methodology have have been built for the cybersecurity domain.
a significant impact on the model performance. CyBERT (Priyanka Ranade and Finin, 2021) trains
a BERT model, and SecureBERT (Aghaei et al.,
1 Introduction 2023) trains a RoBERTa model using the contin-
In response to rapidly growing cyber-attacks, cy- ual pretraining method. jackaduma (2022) intro-
bersecurity experts publish many CTI reports, de- duces SecBERT and SecRoBERTa models trained
tailing on new security vulnerabilities and malware. from scratch. However, these models either do
While these reports help security analysts to better not provide training details or are not evaluated on
understand the cyber-threats, it is very difficult to many cybersecurity tasks.
digest all the information in a timely manner. Thus, We present CTI-BERT, a BERT model pretrained
automatic extraction of CTI from text has gained a from scratch with a high quality cybersecurity
lot of attention from the cybersecurity community. corpus containing CTI reports and publications.
However, general-domain language models In CTI-BERT, both the vocabulary and the model
(LMs) are not effective for cybersecurity text due to weights are learned from our corpus. Further, we in-
differences in terminology and styles. Earlier stud- troduce a variety of sentence-level and token-level
ies have demonstrated that domain-specific LMs classification tasks and benchmark datasets for the
are crucial for domain-specific applications (Belt- security domain. The experimental results demon-
agy et al., 2019; Lee et al., 2020; Huang et al., 2019; strate that CTI-BERT outperforms other general-
Peng et al., 2019; Gu et al., 2022; Chalkidis et al., domain and security domain models, confirming
2020; Hu et al., 2022; Priyanka Ranade and Finin, that training a domain model from scratch with a
2021; Aghaei et al., 2023). high quality domain-specific corpus is critical.
Two different approaches have been used to pro- To the best of our knowledge, this work provides
duce domain-specific language models: continual the most comprehensive evaluations for classifi-
pretraining and pretraining from scratch. The con- cation task within the security domain. Accom-
tinual pretraining method takes an existing general- plishing these tasks is a crucial part of the broader
domain model and continues training the model process of automatically extracting CTI, suggesting
using a domain-specific corpus. While this ap- appropriate mitigation strategies, and implement-
proach is useful, especially when the size of the ing counter-measurements to thwart attacks. Thus,
domain-specific corpus is small, the vocabulary we see our work as an essential milestone towards
113
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 113–122
December 6-10, 2023 ©2023 Association for Computational Linguistics
more intelligent tools for cybersecurity systems. The CISSP textbooks cover all information secu-
The main contributions of our work are the fol- rity topics including access control, cryptography,
lowing: hardware and network security, risk management
and recovery planning.
• We curate a large amount of high quality cy-
bersecurity datasets specifically designed for Academic Paper This dataset contains all the pa-
cyber-threat intelligence analysis. pers in the proceedings of USENIX Security Sym-
posium, a premier security conference, from year
• We develop a pre-trained BERT model tai- 1990 through 2021.
lored for the cybersecurity domain.
Security Wiki This dataset contains 7,919
• We perform extensive experiments on a wide Wikipedia pages belonging to the “Computer Se-
range of tasks and benchmark datasets for the curity” category. We download the data starting
security domain and demonstrate the effective- from the ‘Computer Security’ category and recur-
ness of our model. sively extracting pages from its subcategories. We
discarded the subcategories not related to the cy-
2 Training Datasets
bersecurity domain.
We curated a cybersecurity corpus from various rep-
Threat Reports This dataset contains news arti-
utable data sources. The documents are profession-
cles and white papers about cyber-campaigns, mal-
ally written and cover key security topics including
ware, and security vulnerabilities. These articles
cyber-campaigns, malware, and security vulnerabil-
provide in-depth analysis on a specific cyber-attack,
ities. Most of the documents are in HTML and PDF
including the attack techniques, any known charac-
formats. We processed the files using the Apache
teristics of the perpetrator, and potential mitigation
Tika parsers1 to extract the file content. Then, we
methods. We collected the dataset from security
detected sentence boundaries and discarded sen-
companies and the APTnotes collection4 , which
tences if the percentage of word tokens is less than
is a repository of technical reports on Advanced
10% in the sentences. Table 1 summarizes our
Persistent Threat (APT) groups.
document categories and their statistics.
Vulnerability This dataset contains records from
Document Set # Sentences # Tokens CVE (Common Vulnerabilities and Exposures)5
Attack Description 22,086 544,260 and CWE (Common Weakness Enumeration)6 ,
Security Textbook 20,371 438,720
Academic Paper 1,156,026 23,245,317 which offer the catalogs of all known vulnerabil-
Security Wiki 298,450 7,338,609 ities and provide information about the affected
Threat Report 84,639,372 1,195,547,581 products, the vulnerability type, and the impact.
Vulnerability Description 598,265 14,123,559
Total 86,734,570 1,241,238,046 3 Training Methodology
Table 1: Summary of our datasets We first train the WordPiece tokenizer after lower-
casing the security text and produce a vocabulary
with 50,000 tokens. Training a tokenizer from
Attack Description This dataset includes de-
scratch is beneficial, as it can recognize domain-
scriptions about known cyber-attack techniques
specific terms better. Table 13 in Appendix shows
collected from MITRE ATT&CK2 and CAPEC
examples of our tokenizer and BERT for recogniz-
(Common Attack Pattern Enumeration and Clas-
ing security-related words.
sification)3 . They are carefully curated glossaries
Following the observations by RoBERTa (Liu
containing the attack technique name, the definition
et al., 2019), we trained CTI-BERT using only the
and examples, and potential mitigation approaches.
Masked Language Modeling (MLM) objective us-
Security Textbook The dataset contains two on- ing the HuggingFace’s MLM training script. The
line text books for the CISSP (Certified Information model was trained for 200,000 steps with 15% mlm
Systems Security Professional) certification test. probability, the sequence length of 256, the total
1 4
[Link] [Link]
2 5
[Link] [Link]
3 6
[Link] [Link]

114
batch size of 2,048, the learning rate of 5e-4 with 4.2 Sentence Classification Tasks
learning rate warm-up to 10,000 steps and weight For sentence or document-level classification, we
decay of 0.01. We use the Adam optimizer with add onto the pretrained language models a classifi-
β1 = 0.9, β2 = 0.98, ϵ = 1e − 6. cation head, with one hidden layer and one output
projection layer connected with tanh activation,
4 Cybersecurity Applications which takes the average of the last hidden states of
all tokens in sentences as the input. We fine-tune
We evaluate CTI-BERT using several security the pretrained models together with the randomly
NLP applications and compare its results with initialized classification layers, using 1,000 warm-
both general-domain models and other cyberse- up steps, with learning rate varied according to the
curity domain models. The baseline models are formula in Vaswani et al. (2017). We use the Adam
bert-base-uncased, SecBERT (BERT models) optimizer with β1 = 0.9, β2 = 0.999, and weight
and roberta-base, SecRoBERTa and SecureBERT decay of 0.01. All the models are trained for 50
(RoBERTa models). All the baseline models are epochs with the batch size of 16 and the learning
downloaded from HuggingFace. rate of 2e-5.
The downstream applications can be categorized For the evaluation, we train five models with
as sentence-level classification tasks and token- five different seeds (42, 142, 242, 342, and 442)
level classification tasks. The goal of the exper- for each task and report both the micro and macro
iments is to compare different pretrained models mean F1 score (Mean) and the standard deviation
rather than optimizing the classification models for (Std.) over the five models.
individual tasks. Thus, we use the same model ar- 4.2.1 ATT&CK Technique Classification
chitecture and hyper-parameters to fine-tune mod-
The key knowledge SoC analysts look for in CTI
els for all sub-tasks in each application category.
reports is information about malware behavior and
the adversary’s tactics and techniques. The MITRE
4.1 Masked Word Prediction ATT&CK framework8 offers a knowledge base of
these adversary tactics and techniques, which has
First, we conduct the masked token prediction task
been used as a foundation for the threat models and
to measure how well the models understand the do-
methodologies in many security products.
main knowledge. To ensure that the test sentences
To facilitate research on identifying ATT&CK
are not in the training data, we use five headlines
techniques in prose-based CTI reports, MITRE cre-
from security news published in January and Febru-
ated TRAM9 , a dataset containing sentences from
ary, 20237 . Table 2 shows the test sentences and
CTI reports labeled with the ATT&CK techniques.
the models’ predictions. For each sentence, we
We observe that TRAM contains duplicate sen-
conduct the masked token prediction twice with
tences across the splits. We remove the duplicates
different masked words. The upper line shows the
and keep only the classes with at least one sentence
predictions for <mask>1 , and the lower line shows
in train, development and test splits. The cleaned
the predictions for <mask>2 respectively.
dataset contains 1,491 sentences, 166,284 tokens,
The results clearly show that CTI-BERT performs and 73 distinct classes. More detailed statistics
very well in this test; its predictions are either the of the dataset is shown in Table 15 in Appendix.
same words (boldfaced) or synonyms (italicized). Note that this dataset is very sparse and imbal-
Note that CTI-BERT produces RAT for “PlugX anced. Table 3 shows the results of the six models
<mask>”, which is a more specific term than the for this task. As we can see, CTI-BERToutperforms
masked word (‘malware’). RAT (Remote Access all other models by a large margin.
Trojan) is the malware family which PlugX be-
longs to. However, both SecBERT and SecRoBERTa 4.2.2 IoT App Description Classification
do not perform well for this test, even though IoTSpotter is a tool for automatically identifying
they were trained with security text. Interestingly, Mobile-IoT (Internet of Things) apps, IoT-specific
roberta-base performs better than these models library, and potential vulnerabilities in the IoT
and bert-base-uncased. 8
[Link]
9
[Link]
7
[Link] defense/tram

115
Masked Sentence bert-base-uncased SecBERT CTI-BERT roberta-base SecRoBERTa SecureBERT
New Mirai <malware>1 variant infects Linux de- linux . malware worm this malware
vices to build DDoS <botnet>2 .
attacks attacks botnets attacks commands botnets
The <Colonial>1 Pipeline incident is one of the most oil it colonial Pegasus the Olympic
infamous <ransomware>2 attacks
pipeline targeted ransomware terrorist cyber cyber
New stealthy Beep malware focuses heavily on intrusion antivirus evading stealth antivirus sandbox
<evading>1 <detection>2
. 2009 detection detection . detection
Microsoft Exchange ProxyShell <flaws>1 is previously vulnerability vulnerability Key Service
<exploited>2 in new crypto-mining attack
resulting resulting exploited exploited eavesdrop used
PlugX <malware>1 hides on USB devices to also is rat 11 silently malware
<infect>2 new Windows hosts
create open infect infect communicate infect

Table 2: Masked Word Prediction (top-1). The actual words, instead of <mask>, are shown for reference.

Micro-F1 Macro-F1 Micro-F1 Macro-F1

Model Model
Mean Std. Mean Std. Mean Std. Mean Std.
bert-base-uncased 61.13 0.73 38.58 0.70 bert-base-uncased 83.24 1.40 64.80 3.13
SecBERT 63.61 0.86 39.56 0.88 SecBERT 83.82 1.13 70.06 2.69
CTI-BERT 69.30 0.96 46.62 1.66 CTI-BERT 85.18 0.98 69.26 2.79
roberta-base 59.44 1.01 37.63 1.06 roberta-base 83.30 1.37 66.5 1.44
SecRoBERTa 57.30 0.58 35.61 0.67 SecRoBERTa 84.24 1.01 70.95 2.04
SecureBERT 63.61 0.65 41.18 0.69 SecureBERT 83.59 1.14 61.74 6.32

Table 3: ATT&CK Technique Classification Results Table 5: Malware Sentence Classification Results

apps (Jin et al., 2022). The authors created a dataset from the SemEval-2018 Task 8, which consisted
containing the descriptions of 7,237 mobile apps of four subtasks to measure NLP capabilities for
which are labeled with mobile IoT apps vs. non-IoT cybersecurity reports (Phandi et al., 2018). The
apps with the distribution of approximately 45% task provided 12,918 annotated sentences extracted
and 55% respectively. They removed stopwords from 85 APT reports, based on the MalwareTextDB
and put together all remaining tokens in the descrip- work (Lim et al., 2017).
tion ignoring the sentence boundaries. We use the The first sub-task is to build models to ex-
datasets10 without any further processing. The data tract sentences about malware. The dataset is bi-
statistics are shown in Table 16 in Appendix. The ased with the ratios of malware and non-malware
models’ classification results are shown in Table 4. sentences being 21% and 79% respectively as
shown in Table 17 in Appendix. The results are
Micro-F1 Macro-F1
Model
Mean Std. Mean Std.
listed in Table 5 which shows that CTI-BERT and
SecRoBERTaperform well on this task.
bert-base-uncased 95.78 0.04 95.70 0.05
SecBERT 94.22 0.21 94.12 0.21
CTI-BERT 96.40 0.26 96.33 0.26
4.2.4 Malware Attribute Classification
roberta-base 95.88 0.26 95.82 0.26 This task classifies sentences into the malware at-
SecRoBERTa 94.59 0.39 94.48 0.40 tribute categories as defined in MAEC (Malware
SecureBERT 96.27 0.13 96.19 0.13 Attribute Enumeration and Characterization) vo-
cabulary11 . MAEC defines the malware attributes
Table 4: Performance for IoT App Classification
in a 2-level hierarchy with four high-level attribute
types—ActionName, Capability, StrategicObjec-
4.2.3 Malware Sentence Detection tives and TacticalObjectives—and 444 low-level
types. This sub-task was conducted by building
The next two tasks, malware sentence detection
models for each of the four high-level attributes.
and malware attribute classification, are borrowed
Table 23 in Appendix shows more details of this
10
[Link]
11
M/IoTSpotter/tree/main/data/dataset [Link]

116
dataset for the four high-level attributes. As we statistics of the entity types in the corpus are shown
can see, the datasets are very sparse with a large in Table 18 and Table 19 in Appendix. Table 10
number of classes. shows the NER results using the mention-level mi-
Tables 6–9 show the classification results for cro average scores.
the four malware attribute types. We can see that
4.3.2 NER2: Fine-grained Security Entities
CTI-BERT performs well, being the best or second
best model, for all four attributes types. We note that some STIX entity types (esp. Indi-
cator) are very broad containing many different
4.3 Token Classification Tasks sub-types and, thus, are difficult to be directly used
by automatic threat investigation applications. We
Here, we compare the models’ effectiveness
redesigned the type system into 16 types by divid-
for token-level classification using two security-
ing broad categories into their subcategories and
domain NER tasks and a token type detection task.
annotated the test dataset from the NER1 task. We
We use the standard sequence tagging setup and
then split the dataset into a [Link] ratio for the
add one dense layer as the classification layer on
train, dev and test sets. Table 20 and Table 21 in
top of the pretrained language models. The clas-
Appendix show the statistics of this dataset. The
sification layer assigns each token to a label us-
NER results in Table 11 show that most models
ing the BIO tagging scheme. Our system is im-
perform better for the finer-grained types, and espe-
plemented in PyTorch using HuggingFace’s trans-
cially CTI-BERT outperforms all other models by a
formers (Wolf et al., 2019). The training data is
large margin.
randomly shuffled, and a batch size of 16 is used
with post-padding. We set the maximum sequence 4.3.3 Token Type Classification
length to 256 and use cross entropy loss for model The token type detection task is the sub-task2 from
optimization with the learning rate of 2e-5. All SemEval2018 Task8 which aims to classify tokens
other training parameters were set to the default to Entity, Action and Modifier, and Other cate-
values in transformers. Similarly to the sentence gories. Action refers to an event. Entity refers to
classification tasks, we train five models for each the initiator of the Action (i.e., Subject) or the recip-
task with the same five seeds for 50 epochs and ient of the Action (i.e., Object). Modifier refers to
compare the average mention-level precision, re- tokens that provide elaboration on the Action. All
call and F1-score. other tokens are assigned to Other. More details
on the dataset are shown in Table 22 in Appendix.
4.3.1 NER1: Coarse-grained Security Entities
Even though the categories are not semantic
Cybersecurity entities have very distinct charac- types as in NER, this task can also be solved as
teristics, and many of them are out of vocabulary a token sequence tagging problem, and, thus, we
terms. Here, we investigate if domain specific lan- apply the same system used for the NER tasks.
guage models can alleviate the vocabulary gap. We The classification results are shown in Table 12.
collected 967 CTI reports on malware and vulnera- Overall, the models don’t perform very well likely
bilities. The documents are labeled with the 8 entity because the mentions are long and semantically het-
types defined in STIX (Structured Threat Informa- erogeneous. The results show that the BERT based
tion Expression)12 , which is a standard framework models perform better than the RoBERTa-based
for cyber intelligence exchange. The 8 types are models.
Campaign (names of cyber campaigns), Course-
OfAction (tools or actions to take to deter cyber 5 Related Work
attacks), ExploitTarget (vulnerabilities targeted for
exploitation), Identity (individuals, groups or orga- Motivated by the large-scale foundational models’s
nizations involved in attacks), Indicator (objects successes in many general domain NLP tasks, sev-
used to detect suspicious or malicious cyber ac- eral domain-specific language models have been
tivity), Malware (malicious codes used in cyber developed (Roy et al., 2017, 2019; Mumtaz et al.,
crimes), Resource (tools used for cyber attacks); 2020). In scientific and bio-medical domains,
and ThreatActor (individuals or groups that commit there are SciBERT (Beltagy et al., 2019), Blue-
cyber crimes). The size of the dataset and detailed BERT (Peng et al., 2019), ClinicalBERT (Huang
et al., 2019), BioBERT (Lee et al., 2020) and Pub-
12
[Link] MedBERT (Gu et al., 2022). In political and legal
117
Micro-F1 Macro-F1 Micro-F1 Macro-F1
Model Model
Mean Std. Mean Std. Mean Std. Mean Std.
bert-base-uncased 38.79 19.68 30.37 15.79 bert-base-uncased 60.68 3.91 51.51 5.41
SecBERT 43.64 3.09 33.25 2.97 SecBERT 53.18 1.82 43.39 1.9
CTI-BERT 55.76 4.92 43.37 4.92 CTI-BERT 60.91 2.34 52.23 4.39
roberta-base 56.36 4.11 44.04 3.41 roberta-base 59.77 3.71 50.86 3.80
SecRoBERTa 40.00 2.27 29.03 2.39 SecRoBERTa 46.82 1.96 37.70 4.26
SecureBERT 52.12 2.97 39.97 3.32 SecureBERT 61.59 2.73 54.12 4.66

Table 6: Performance for ActionName attributes Table 7: Performance for Capability attributes

Micro-F1 Macro-F1 Micro-F1 Macro-F1

Model Model
Mean std Mean std Mean std Mean std
bert-base-uncased 43.71 1.46 29.31 2.27 bert-base-uncased 36.19 1.56 19.00 1.55
SecBERT 38.57 2.86 21.12 2.42 SecBERT 35.24 2.54 19.58 2.75
CTI-BERT 45.14 4.30 28.11 4.69 CTI-BERT 49.84 1.62 31.49 2.24
roberta-base 47.14 2.02 33.22 3.81 roberta-base 42.54 0.63 23.95 1.15
SecRoBERTa 37.71 4.00 22.42 4.76 SecRoBERTa 35.87 3.27 20.37 4.18
SecBERT 44.00 4.98 30.74 6.98 SecureBERT 40.32 4.33 24.37 4.38

Table 8: Performance for StrategicObjective attributes Table 9: Performance for TacticalObjective attributes

Model Type Precison Recall F1 mance of downstream applications for the domain.
bert-base-uncased 72.04 68.67 70.31 There have been several attempts to construct
SecBERT 69.74 63.98 66.73 language models for the cybersecurity domain.
CTI-BERT 75.63 75.88 75.75
Roy et al. (2017, 2019) propose techniques to effi-
roberta-base 72.52 68.99 70.70 ciently learn domain-specific language models with
SecRoBERTa 68.00 59.46 63.44
SecureBERT 73.47 72.51 72.99 a small-size in-domain corpus by incorporating ex-
ternal domain knowledge. They train Word2Vec
Table 10: NER1 Results (mention-level micro average) models using malware descriptions. Similarly,
Mumtaz et al. (2020) train a Word2Vec model
Model Precison Recall F1 using security vulnerability-related bulletins and
bert-base-uncased 73.44 68.23 70.73 Wikipedia pages.
SecBERT 68.58 60.90 64.43 Recently, transformers-based models have
CTI-BERT 83.35 78.62 80.91
been built for the cybersecurity domain: Cy-
roberta-base 72.17 73.51 72.80 BERT (Priyanka Ranade and Finin, 2021),
SecRoBERTa 71.91 55.01 62.34
SecureBERT 76.66 75.98 76.30 SecBERT (jackaduma, 2022) and Secure-
BERT (Aghaei et al., 2023). CyBERT is trained
Table 11: NER2 Results (mention-level micro average) with a relatively small corpus consisting of 500
security blogs, 16,000 CVE records, and the
Model Type Precison Recall F1 APTnotes collection. Further, CyBERT applies the
bert-base-uncased 22.97 44.51 30.27 continual pretraining and uses the BERT model’s
SecBERT 21.63 36.20 27.02 vocabulary after adding 1,000 most frequent words
CTI-BERT 22.67 47.77 30.70 in their corpus which do not exist in the base
roberta-base 15.05 17.44 15.97 vocabulary. SecBERT provides both BERT and
SecRoBERTa 14.18 20.71 16.81
SecureBERT 22.58 46.97 30.46 RoBERTa models trained on a security corpus
consisting of APTnotes, the SemEval2018 Task8
Table 12: Token Type Classification Results (mention- dataset and Stucco-Data13 which contains security
level micro average) blogs and reports. However, the details about
the data and any experimental results are not
available. SecureBERT trains a RoBERTa model
domains, there are ConflictBERT (Hu et al., 2022)
using security reports, white papers, academic
and LegalBERT (Chalkidis et al., 2020). These
domain models have shown to improve the perfor- 13
[Link]

118
books, etc., which are similar to our dataset References
both in terms of the size and document type. Ehsan Aghaei, Xi Niu, Waseem Shadid, and Ehab Al-
However, the model is built using the continual Shaer. 2023. SecureBERT: A Domain-Specific Lan-
pretraining method while CTI-BERT is trained from guage Model for Cybersecurity, pages 39–56.
scratch. We believe that the main difference comes
from CTI-BERT being trained from scratch and Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. SciBERT:
A pretrained language model for scientific text. In
having the vocabulary specialized to the domain, Proceedings of the 2018 Conference on Empirical
compared to the extended vocabulary used in Methods in Natural Language Processing (EMNLP).
CyBERT and SecureBERT. Table 14 compares Association for Computational Linguistics.
different training strategies used for these models.
Ilias Chalkidis, Manos Fergadiotis, Prodromos Malaka-
siotis, Nikolaos Aletras, and Ion Androutsopoulos.
6 Conclusion 2020. LEGAL-BERT: The muppets straight out of
law school. In Findings of the Association for Com-
putational Linguistics: EMNLP 2020, pages 2898–
We presented a new pretrained BERT model tai- 2904. Association for Computational Linguistics.
lored for the cybersecurity domain. Specifically,
we designed the model to improve the accuracy of Yu Gu, Robert Tinn, Hao Cheng, Michael Lucas, Naoto
cyber-threat intelligence extraction and understand- Usuyama, Xiaodong Liu, Tristan Naumann, Jian-
feng Gao, and Hoifung Poon. 2022. Domain-specific
ing, such as security entity (IoCs) extraction and language model pretraining for biomedical natural
attack technique (TTPs) classification. As demon- language processing. ACM Trans. Comput. Heal.,
strated by the experiments in Section 4, our model 3(1):2:1–2:23.
outperforms existing general domain and other cy-
Yibo Hu, MohammadSaleh Hosseini, Erick Skorupa
bersecurity domain models with the same base ar- Parolin, Javier Osorio, Latifur Khan, Patrick Brandt,
chitecture. For future work, we plan to collect more and Vito D’Orazio. 2022. Conflibert: A pre-trained
documents to improve the model and also to train language model for political conflict and violence. In
other language models to support different security The Conference of the North American Chapter of the
Association for Computational Linguistics NAACL.
applications.
Kexin Huang, Jaan Altosaar, and Rajesh Ranganath.
Limitations 2019. Clinicalbert: Modeling clinical notes and pre-
dicting hospital readmission. CoRR.
The model is pretrained using only English data. jackaduma. 2022. SecBERT.
While the majority of cybersecurity-related in- [Link]
formation is distributed in English, we consider
adding support for multiple languages in the fu- Xin Jin, Sunil Manandhar, Kaushal Kafle, Zhiqiang
Lin, and Adwait Nadkarni. 2022. Understanding iot
ture work. Further, while we demonstrate that security from a market-scale perspective. In Pro-
CTI-BERT outperforms other security-specific LMs ceedings of the 2022 ACM SIGSAC Conference on
for a variety of tasks, the benchmark datasets are Computer and Communications Security, CCS, pages
relatively small. Thus, the findings may not be con- 1615–1629. ACM.
clusive, and further evaluations with more data are Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon
needed. Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang.
2020. Biobert: a pre-trained biomedical language
representation model for biomedical text mining.
Ethical Considerations Bioinformatics, 36(4):1234–1240.

To our knowledge, this research has a very low risk Swee Kiat Lim, Aldrian Obaja Muis, Wei Lu, and
for ethical perspectives. All datasets were collected Ong Chen Hui. 2017. MalwareTextDB: A database
for annotated malware articles. In Proceedings of the
from reputable sources, which are publicly avail- 55th Annual Meeting of the Association for Compu-
able. The only person information in our corpus tational Linguistics, ACL, pages 1557–1567.
is the authors’ names and their affiliations in the
USENIX Security proceedings. However, we do Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
not expose their identities nor use the information Luke Zettlemoyer, and Veselin Stoyanov. 2019.
in this work. Roberta: A robustly optimized bert pretraining ap-
proach. arXiv preprint arXiv:1907.11692.

119
Sara Mumtaz, Carlos Rodriguez, Boualem Benatallah,
Mortada Al-Banna, and Shayan Zamanirad. 2020.
Learning word representation for the cyber security
vulnerability domain. In International Joint Confer-
ence on Neural Networks (IJCNN), pages 1–8.
Yifan Peng, Shankai Yan, and Zhiyong Lu. 2019. Trans-
fer learning in biomedical natural language process-
ing: An evaluation of BERT and elmo on ten bench-
marking datasets. In Proceedings of the 18th BioNLP
Workshop and Shared Task, BioNLP@ACL, pages 58–
65.
Peter Phandi, Amila Silva, and Wei Lu. 2018. Semeval-
2018 task 8: Semantic extraction from cybersecurity
reports using natural language processing (securenlp).
In Proceedings of The 12th International Workshop
on Semantic Evaluation, SemEval@NAACL-HLT
2018, pages 697–706.

Anupam Joshi Priyanka Ranade, Aritran Piplai and Tim

Finin. 2021. CyBERT: Contextualized Embeddings
for the Cybersecurity Domain. In IEEE International
Conference on Big Data.

Arpita Roy, Youngja Park, and Shimei Pan. 2017. Learn-

ing domain-specific word embeddings from sparse
cybersecurity texts. CoRR.
Arpita Roy, Youngja Park, and Shimei Pan. 2019. In-
corporating domain knowledge in learning word em-
bedding. In 31st IEEE International Conference on
Tools with Artificial Intelligence, ICTAI, pages 1568–
1573. IEEE.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob

Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. In Advances in Neural Information Pro-
cessing Systems, volume 30. Curran Associates, Inc.

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien

Chaumond, Clement Delangue, Anthony Moi, Pier-
ric Cistac, Tim Rault, Rémi Louf, Morgan Funtow-
icz, Joe Davison, Sam Shleifer, Patrick von Platen,
Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu,
Teven Le Scao, Sylvain Gugger, Mariama Drame,
Quentin Lhoest, and Alexander M. Rush. 2019. Hug-
gingface’s transformers: State-of-the-art natural lan-
guage processing.

120
A Details on Model Training
Term CTI-BERT bert-base-uncased
apt* apt, apt1, apt10, apt28, apt29, apt41, apts apt
backdoor* backdoor, backdoored, backdoors –
*bot abbot, agobot, bot, gaobot, ircbot, ourbot, qakbot, qbot, rbot, robot, sabot, abbot, bot, robot, talbot
sdbot, spybot, syzbot, trickbot, zbot
*crime* crime, crimes, crimeware, cybercrime crime, crimea, crimean,
crimes
crypto* crypto, cryptoc, cryptocurr, cryptocurrencies, cryptocurrency, cryptograph, –
cryptographers, cryptographic, cryptographically, cryptography, cryptojack-
ing, cryptol, cryptolocker, cryptology, cryptom, cryptomining, cryptosystem,
cryptosystems, cryptow, cryptowall
cyber* cyber, cyberark, cyberattack, cyberattackers, cyberattacks, cyberb, cybercri, cyber
cybercrime, cybercrimes, cybercriminal, cybercriminals, cyberdefense, cybere,
cybereason, cyberespionage, cybers, cybersec, cybersecurity, cyberspace,
cyberthre, cyberthreat, cyberthreats, cyberwar, cyberwarfare, cyberweap
dark* dark, darknet, darkreading, darks, darkside dark, darkened, dark-
ening, darker, darkest,
darkly, darkness
hijack* [hijack, hijacked, hijacker, hijackers, hijacking, hijacks –
key* key, keybase, keyboard, keyboards, keychain, keyctl, keyed, keygen, key- key, keyboard, key-
ing, keylog, keylogger, keyloggers, keylogging, keynote, keypad, keyring, boardist, keyboards,
keyrings, keys, keyspan, keyst, keystone, keystore, keystream, keystro, keynes, keynote, keys,
keystroke, keystrokes, keytouch, keyword, keywords keystone
*kit applewebkit, bootkit, kit, rootkit, toolkit, webkit bukit, kit
malware* malware, malwarebytes, malwares –
*net botnet, cabinet, cnet, darknet, dotnet, ethernet, fortinet, genet, honeynet, inet, barnet, baronet, bon-
internet, intranet, kennet, kinet, kuznet, magnet, monet, net, phonet, planet, net, cabinet, clarinet,
stuxnet, subnet, technet, telnet, vnet, x9cinternet, zdnet ethernet, hornet, inter-
net, janet, magnet, net,
planet
trojan* trojan, trojanized, trojans trojan
*virus* antivirus, coronavirus, virus, viruses, virusscan, virustotal virus, viruses
web* web, webapp, webapps, webassembly, webc, webcam, webcams, webcast, we- web, webb, webber, we-
bcasts, webclient, webcore, webd, webdav, webex, webgl, webhook, webin, ber, website, websites,
webinar, webkit, webkitbuild, webkitgtk, weblog, weblogic, webm, web- webster
mail, webmaster, webpage, webpages, webresources, webroot, webrtc, webs,
websense, webserver, webshell, website, websites, websocket, webspace,
websphere, webtools, webview
*ware adware, antimalware, aware, beware, coveware, crimeware, delaware, de- aware, delaware, hard-
signware, firmware, foxitsoftware, freeware, hardware, malware, middleware, ware, software, unaware,
radware, ransomware, scareware, shareware, slackware, software, spyware, ware
unaware, vmware, ware, x9cmalware

Table 13: Comparison of Vocabulary. For a fair compar-

ison, we generated our tokenizer with 30,000 tokens.
Model Base Training mode Vocab. Seq. Batch Train Steps
CTI-BERT scratch 50,000 256 2,048 200,000
CyBERT BERT-base continual 29,996 (base+1,000 security) 128 – 1 epoch
SecBERT scratch 52,000 – – –
SecRoBERTa scratch 52,000 – – –
RoBERTa-base
SecureBERT continual 50,265 (base+17,673 security) 512 144 250,000

Table 14: Comparison of Model Training.

“– indicates the information is not available.

121
B Details on Experiment Datasets

Train Dev. Test Total Entity Type Train Dev Test Total
# Sentences 754 355 382 1,491 Campaign 39 0 4 43
# Tokens 138,721 19,578 7,985 166,284 SecurityAdvisory 54 12 30 96
Vulnerability 401 50 86 537
Table 15: Summary of TRAM Data DomainName 169 3 16 188
EmailAddress 6 1 1 8
Endpoint 3 0 0 3
FileName 210 37 24 271
Train Dev Test Total Hash 93 5 3 101
IpAddress 37 0 2 39
# Documents 5,214 1,058 965 7,237 Network 3 0 0 3
# Tokens 635,220 133,546 106,084 874,850 URL 181 20 27 228
WindowsRegistry 9 0 0 9
Table 16: Summary of IoTSpotter Data AvSignature 99 13 10 122
MalwareFamily 554 53 47 654
Technique 334 39 76 449
ThreatActor 89 4 7 100
Train Dev. Test Total
# Sentences 9,424 1,213 618 11,255 Table 21: Entity Types and Distributions in the NER2
# Tokens 1,020,655 146,362 56,216 1,223,233 Dataset

Table 17: Summary of the Malware Sentence Data

Train Dev Test Total

# Documents 667 167 133 967
# Sentences 38,721 6,322 9,837 54,880
# Tokens 465,826 92,788 119,613 678,227 #Doc. #Sent. #Action #Entity #Mod.
Train 65 9,424 3,202 6,875 2,011
Table 18: Summary of the NER1 Dataset Dev 5 1,213 122 254 79
Test 5 618 125 249 79
Total 75 11,255 3,449 7,378 2,169
Entity Type Train Dev Test
Table 22: Dataset for Token Type Classification
Campaign 247 27 85
CourseOfAction 1,938 779 329
ExploitTarget 5,839 1,412 1,282
Identity 6,175 1,262 1,692
Indicator 3,718 1,071 886
Malware 4,252 776 1,027
Resource 438 91 114
ThreatActor 755 91 144
ActionName Capability
Table 19: Entity Types and Distributions in the NER1 Split
#Doc. #Sent. #Class #Doc. #Sent #Class
Dataset
Train 65 1,154 99 65 2,817 20
Dev. 5 46 20 5 102 13
Test 5 33 18 5 88 14
Train Dev Test Total
StrategicObjectives TacticalObjectives
# Documents 106 14 13 133 Split
#Doc. #Sent. #Class #Doc. #Sent. #Class
# Sentences 5,206 561 671 6,438
# Tokens 75,969 8,106 9,984 94,059 Train 65 2,206 53 65 1,783 93
Dev. 5 77 28 5 63 26
Table 20: NER2 Dataset Test 5 70 21 5 63 27