A Pretrained Language Model For Cyber Threat Intelligence
A Pretrained Language Model For Cyber Threat Intelligence
114
batch size of 2,048, the learning rate of 5e-4 with 4.2 Sentence Classification Tasks
learning rate warm-up to 10,000 steps and weight For sentence or document-level classification, we
decay of 0.01. We use the Adam optimizer with add onto the pretrained language models a classifi-
β1 = 0.9, β2 = 0.98, ϵ = 1e − 6. cation head, with one hidden layer and one output
projection layer connected with tanh activation,
4 Cybersecurity Applications which takes the average of the last hidden states of
all tokens in sentences as the input. We fine-tune
We evaluate CTI-BERT using several security the pretrained models together with the randomly
NLP applications and compare its results with initialized classification layers, using 1,000 warm-
both general-domain models and other cyberse- up steps, with learning rate varied according to the
curity domain models. The baseline models are formula in Vaswani et al. (2017). We use the Adam
bert-base-uncased, SecBERT (BERT models) optimizer with β1 = 0.9, β2 = 0.999, and weight
and roberta-base, SecRoBERTa and SecureBERT decay of 0.01. All the models are trained for 50
(RoBERTa models). All the baseline models are epochs with the batch size of 16 and the learning
downloaded from HuggingFace. rate of 2e-5.
The downstream applications can be categorized For the evaluation, we train five models with
as sentence-level classification tasks and token- five different seeds (42, 142, 242, 342, and 442)
level classification tasks. The goal of the exper- for each task and report both the micro and macro
iments is to compare different pretrained models mean F1 score (Mean) and the standard deviation
rather than optimizing the classification models for (Std.) over the five models.
individual tasks. Thus, we use the same model ar- 4.2.1 ATT&CK Technique Classification
chitecture and hyper-parameters to fine-tune mod-
The key knowledge SoC analysts look for in CTI
els for all sub-tasks in each application category.
reports is information about malware behavior and
the adversary’s tactics and techniques. The MITRE
4.1 Masked Word Prediction ATT&CK framework8 offers a knowledge base of
these adversary tactics and techniques, which has
First, we conduct the masked token prediction task
been used as a foundation for the threat models and
to measure how well the models understand the do-
methodologies in many security products.
main knowledge. To ensure that the test sentences
To facilitate research on identifying ATT&CK
are not in the training data, we use five headlines
techniques in prose-based CTI reports, MITRE cre-
from security news published in January and Febru-
ated TRAM9 , a dataset containing sentences from
ary, 20237 . Table 2 shows the test sentences and
CTI reports labeled with the ATT&CK techniques.
the models’ predictions. For each sentence, we
We observe that TRAM contains duplicate sen-
conduct the masked token prediction twice with
tences across the splits. We remove the duplicates
different masked words. The upper line shows the
and keep only the classes with at least one sentence
predictions for <mask>1 , and the lower line shows
in train, development and test splits. The cleaned
the predictions for <mask>2 respectively.
dataset contains 1,491 sentences, 166,284 tokens,
The results clearly show that CTI-BERT performs and 73 distinct classes. More detailed statistics
very well in this test; its predictions are either the of the dataset is shown in Table 15 in Appendix.
same words (boldfaced) or synonyms (italicized). Note that this dataset is very sparse and imbal-
Note that CTI-BERT produces RAT for “PlugX anced. Table 3 shows the results of the six models
<mask>”, which is a more specific term than the for this task. As we can see, CTI-BERToutperforms
masked word (‘malware’). RAT (Remote Access all other models by a large margin.
Trojan) is the malware family which PlugX be-
longs to. However, both SecBERT and SecRoBERTa 4.2.2 IoT App Description Classification
do not perform well for this test, even though IoTSpotter is a tool for automatically identifying
they were trained with security text. Interestingly, Mobile-IoT (Internet of Things) apps, IoT-specific
roberta-base performs better than these models library, and potential vulnerabilities in the IoT
and bert-base-uncased. 8
[Link]
9
[Link]
7
[Link] defense/tram
115
Masked Sentence bert-base-uncased SecBERT CTI-BERT roberta-base SecRoBERTa SecureBERT
New Mirai <malware>1 variant infects Linux de- linux . malware worm this malware
vices to build DDoS <botnet>2 .
attacks attacks botnets attacks commands botnets
The <Colonial>1 Pipeline incident is one of the most oil it colonial Pegasus the Olympic
infamous <ransomware>2 attacks
pipeline targeted ransomware terrorist cyber cyber
New stealthy Beep malware focuses heavily on intrusion antivirus evading stealth antivirus sandbox
<evading>1 <detection>2
. 2009 detection detection . detection
Microsoft Exchange ProxyShell <flaws>1 is previously vulnerability vulnerability Key Service
<exploited>2 in new crypto-mining attack
resulting resulting exploited exploited eavesdrop used
PlugX <malware>1 hides on USB devices to also is rat 11 silently malware
<infect>2 new Windows hosts
create open infect infect communicate infect
Table 2: Masked Word Prediction (top-1). The actual words, instead of <mask>, are shown for reference.
Table 3: ATT&CK Technique Classification Results Table 5: Malware Sentence Classification Results
apps (Jin et al., 2022). The authors created a dataset from the SemEval-2018 Task 8, which consisted
containing the descriptions of 7,237 mobile apps of four subtasks to measure NLP capabilities for
which are labeled with mobile IoT apps vs. non-IoT cybersecurity reports (Phandi et al., 2018). The
apps with the distribution of approximately 45% task provided 12,918 annotated sentences extracted
and 55% respectively. They removed stopwords from 85 APT reports, based on the MalwareTextDB
and put together all remaining tokens in the descrip- work (Lim et al., 2017).
tion ignoring the sentence boundaries. We use the The first sub-task is to build models to ex-
datasets10 without any further processing. The data tract sentences about malware. The dataset is bi-
statistics are shown in Table 16 in Appendix. The ased with the ratios of malware and non-malware
models’ classification results are shown in Table 4. sentences being 21% and 79% respectively as
shown in Table 17 in Appendix. The results are
Micro-F1 Macro-F1
Model
Mean Std. Mean Std.
listed in Table 5 which shows that CTI-BERT and
SecRoBERTaperform well on this task.
bert-base-uncased 95.78 0.04 95.70 0.05
SecBERT 94.22 0.21 94.12 0.21
CTI-BERT 96.40 0.26 96.33 0.26
4.2.4 Malware Attribute Classification
roberta-base 95.88 0.26 95.82 0.26 This task classifies sentences into the malware at-
SecRoBERTa 94.59 0.39 94.48 0.40 tribute categories as defined in MAEC (Malware
SecureBERT 96.27 0.13 96.19 0.13 Attribute Enumeration and Characterization) vo-
cabulary11 . MAEC defines the malware attributes
Table 4: Performance for IoT App Classification
in a 2-level hierarchy with four high-level attribute
types—ActionName, Capability, StrategicObjec-
4.2.3 Malware Sentence Detection tives and TacticalObjectives—and 444 low-level
types. This sub-task was conducted by building
The next two tasks, malware sentence detection
models for each of the four high-level attributes.
and malware attribute classification, are borrowed
Table 23 in Appendix shows more details of this
10
[Link]
11
M/IoTSpotter/tree/main/data/dataset [Link]
116
dataset for the four high-level attributes. As we statistics of the entity types in the corpus are shown
can see, the datasets are very sparse with a large in Table 18 and Table 19 in Appendix. Table 10
number of classes. shows the NER results using the mention-level mi-
Tables 6–9 show the classification results for cro average scores.
the four malware attribute types. We can see that
4.3.2 NER2: Fine-grained Security Entities
CTI-BERT performs well, being the best or second
best model, for all four attributes types. We note that some STIX entity types (esp. Indi-
cator) are very broad containing many different
4.3 Token Classification Tasks sub-types and, thus, are difficult to be directly used
by automatic threat investigation applications. We
Here, we compare the models’ effectiveness
redesigned the type system into 16 types by divid-
for token-level classification using two security-
ing broad categories into their subcategories and
domain NER tasks and a token type detection task.
annotated the test dataset from the NER1 task. We
We use the standard sequence tagging setup and
then split the dataset into a [Link] ratio for the
add one dense layer as the classification layer on
train, dev and test sets. Table 20 and Table 21 in
top of the pretrained language models. The clas-
Appendix show the statistics of this dataset. The
sification layer assigns each token to a label us-
NER results in Table 11 show that most models
ing the BIO tagging scheme. Our system is im-
perform better for the finer-grained types, and espe-
plemented in PyTorch using HuggingFace’s trans-
cially CTI-BERT outperforms all other models by a
formers (Wolf et al., 2019). The training data is
large margin.
randomly shuffled, and a batch size of 16 is used
with post-padding. We set the maximum sequence 4.3.3 Token Type Classification
length to 256 and use cross entropy loss for model The token type detection task is the sub-task2 from
optimization with the learning rate of 2e-5. All SemEval2018 Task8 which aims to classify tokens
other training parameters were set to the default to Entity, Action and Modifier, and Other cate-
values in transformers. Similarly to the sentence gories. Action refers to an event. Entity refers to
classification tasks, we train five models for each the initiator of the Action (i.e., Subject) or the recip-
task with the same five seeds for 50 epochs and ient of the Action (i.e., Object). Modifier refers to
compare the average mention-level precision, re- tokens that provide elaboration on the Action. All
call and F1-score. other tokens are assigned to Other. More details
on the dataset are shown in Table 22 in Appendix.
4.3.1 NER1: Coarse-grained Security Entities
Even though the categories are not semantic
Cybersecurity entities have very distinct charac- types as in NER, this task can also be solved as
teristics, and many of them are out of vocabulary a token sequence tagging problem, and, thus, we
terms. Here, we investigate if domain specific lan- apply the same system used for the NER tasks.
guage models can alleviate the vocabulary gap. We The classification results are shown in Table 12.
collected 967 CTI reports on malware and vulnera- Overall, the models don’t perform very well likely
bilities. The documents are labeled with the 8 entity because the mentions are long and semantically het-
types defined in STIX (Structured Threat Informa- erogeneous. The results show that the BERT based
tion Expression)12 , which is a standard framework models perform better than the RoBERTa-based
for cyber intelligence exchange. The 8 types are models.
Campaign (names of cyber campaigns), Course-
OfAction (tools or actions to take to deter cyber 5 Related Work
attacks), ExploitTarget (vulnerabilities targeted for
exploitation), Identity (individuals, groups or orga- Motivated by the large-scale foundational models’s
nizations involved in attacks), Indicator (objects successes in many general domain NLP tasks, sev-
used to detect suspicious or malicious cyber ac- eral domain-specific language models have been
tivity), Malware (malicious codes used in cyber developed (Roy et al., 2017, 2019; Mumtaz et al.,
crimes), Resource (tools used for cyber attacks); 2020). In scientific and bio-medical domains,
and ThreatActor (individuals or groups that commit there are SciBERT (Beltagy et al., 2019), Blue-
cyber crimes). The size of the dataset and detailed BERT (Peng et al., 2019), ClinicalBERT (Huang
et al., 2019), BioBERT (Lee et al., 2020) and Pub-
12
[Link] MedBERT (Gu et al., 2022). In political and legal
117
Micro-F1 Macro-F1 Micro-F1 Macro-F1
Model Model
Mean Std. Mean Std. Mean Std. Mean Std.
bert-base-uncased 38.79 19.68 30.37 15.79 bert-base-uncased 60.68 3.91 51.51 5.41
SecBERT 43.64 3.09 33.25 2.97 SecBERT 53.18 1.82 43.39 1.9
CTI-BERT 55.76 4.92 43.37 4.92 CTI-BERT 60.91 2.34 52.23 4.39
roberta-base 56.36 4.11 44.04 3.41 roberta-base 59.77 3.71 50.86 3.80
SecRoBERTa 40.00 2.27 29.03 2.39 SecRoBERTa 46.82 1.96 37.70 4.26
SecureBERT 52.12 2.97 39.97 3.32 SecureBERT 61.59 2.73 54.12 4.66
Table 6: Performance for ActionName attributes Table 7: Performance for Capability attributes
Table 8: Performance for StrategicObjective attributes Table 9: Performance for TacticalObjective attributes
Model Type Precison Recall F1 mance of downstream applications for the domain.
bert-base-uncased 72.04 68.67 70.31 There have been several attempts to construct
SecBERT 69.74 63.98 66.73 language models for the cybersecurity domain.
CTI-BERT 75.63 75.88 75.75
Roy et al. (2017, 2019) propose techniques to effi-
roberta-base 72.52 68.99 70.70 ciently learn domain-specific language models with
SecRoBERTa 68.00 59.46 63.44
SecureBERT 73.47 72.51 72.99 a small-size in-domain corpus by incorporating ex-
ternal domain knowledge. They train Word2Vec
Table 10: NER1 Results (mention-level micro average) models using malware descriptions. Similarly,
Mumtaz et al. (2020) train a Word2Vec model
Model Precison Recall F1 using security vulnerability-related bulletins and
bert-base-uncased 73.44 68.23 70.73 Wikipedia pages.
SecBERT 68.58 60.90 64.43 Recently, transformers-based models have
CTI-BERT 83.35 78.62 80.91
been built for the cybersecurity domain: Cy-
roberta-base 72.17 73.51 72.80 BERT (Priyanka Ranade and Finin, 2021),
SecRoBERTa 71.91 55.01 62.34
SecureBERT 76.66 75.98 76.30 SecBERT (jackaduma, 2022) and Secure-
BERT (Aghaei et al., 2023). CyBERT is trained
Table 11: NER2 Results (mention-level micro average) with a relatively small corpus consisting of 500
security blogs, 16,000 CVE records, and the
Model Type Precison Recall F1 APTnotes collection. Further, CyBERT applies the
bert-base-uncased 22.97 44.51 30.27 continual pretraining and uses the BERT model’s
SecBERT 21.63 36.20 27.02 vocabulary after adding 1,000 most frequent words
CTI-BERT 22.67 47.77 30.70 in their corpus which do not exist in the base
roberta-base 15.05 17.44 15.97 vocabulary. SecBERT provides both BERT and
SecRoBERTa 14.18 20.71 16.81
SecureBERT 22.58 46.97 30.46 RoBERTa models trained on a security corpus
consisting of APTnotes, the SemEval2018 Task8
Table 12: Token Type Classification Results (mention- dataset and Stucco-Data13 which contains security
level micro average) blogs and reports. However, the details about
the data and any experimental results are not
available. SecureBERT trains a RoBERTa model
domains, there are ConflictBERT (Hu et al., 2022)
using security reports, white papers, academic
and LegalBERT (Chalkidis et al., 2020). These
domain models have shown to improve the perfor- 13
[Link]
118
books, etc., which are similar to our dataset References
both in terms of the size and document type. Ehsan Aghaei, Xi Niu, Waseem Shadid, and Ehab Al-
However, the model is built using the continual Shaer. 2023. SecureBERT: A Domain-Specific Lan-
pretraining method while CTI-BERT is trained from guage Model for Cybersecurity, pages 39–56.
scratch. We believe that the main difference comes
from CTI-BERT being trained from scratch and Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. SciBERT:
A pretrained language model for scientific text. In
having the vocabulary specialized to the domain, Proceedings of the 2018 Conference on Empirical
compared to the extended vocabulary used in Methods in Natural Language Processing (EMNLP).
CyBERT and SecureBERT. Table 14 compares Association for Computational Linguistics.
different training strategies used for these models.
Ilias Chalkidis, Manos Fergadiotis, Prodromos Malaka-
siotis, Nikolaos Aletras, and Ion Androutsopoulos.
6 Conclusion 2020. LEGAL-BERT: The muppets straight out of
law school. In Findings of the Association for Com-
putational Linguistics: EMNLP 2020, pages 2898–
We presented a new pretrained BERT model tai- 2904. Association for Computational Linguistics.
lored for the cybersecurity domain. Specifically,
we designed the model to improve the accuracy of Yu Gu, Robert Tinn, Hao Cheng, Michael Lucas, Naoto
cyber-threat intelligence extraction and understand- Usuyama, Xiaodong Liu, Tristan Naumann, Jian-
feng Gao, and Hoifung Poon. 2022. Domain-specific
ing, such as security entity (IoCs) extraction and language model pretraining for biomedical natural
attack technique (TTPs) classification. As demon- language processing. ACM Trans. Comput. Heal.,
strated by the experiments in Section 4, our model 3(1):2:1–2:23.
outperforms existing general domain and other cy-
Yibo Hu, MohammadSaleh Hosseini, Erick Skorupa
bersecurity domain models with the same base ar- Parolin, Javier Osorio, Latifur Khan, Patrick Brandt,
chitecture. For future work, we plan to collect more and Vito D’Orazio. 2022. Conflibert: A pre-trained
documents to improve the model and also to train language model for political conflict and violence. In
other language models to support different security The Conference of the North American Chapter of the
Association for Computational Linguistics NAACL.
applications.
Kexin Huang, Jaan Altosaar, and Rajesh Ranganath.
Limitations 2019. Clinicalbert: Modeling clinical notes and pre-
dicting hospital readmission. CoRR.
The model is pretrained using only English data. jackaduma. 2022. SecBERT.
While the majority of cybersecurity-related in- [Link]
formation is distributed in English, we consider
adding support for multiple languages in the fu- Xin Jin, Sunil Manandhar, Kaushal Kafle, Zhiqiang
Lin, and Adwait Nadkarni. 2022. Understanding iot
ture work. Further, while we demonstrate that security from a market-scale perspective. In Pro-
CTI-BERT outperforms other security-specific LMs ceedings of the 2022 ACM SIGSAC Conference on
for a variety of tasks, the benchmark datasets are Computer and Communications Security, CCS, pages
relatively small. Thus, the findings may not be con- 1615–1629. ACM.
clusive, and further evaluations with more data are Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon
needed. Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang.
2020. Biobert: a pre-trained biomedical language
representation model for biomedical text mining.
Ethical Considerations Bioinformatics, 36(4):1234–1240.
To our knowledge, this research has a very low risk Swee Kiat Lim, Aldrian Obaja Muis, Wei Lu, and
for ethical perspectives. All datasets were collected Ong Chen Hui. 2017. MalwareTextDB: A database
for annotated malware articles. In Proceedings of the
from reputable sources, which are publicly avail- 55th Annual Meeting of the Association for Compu-
able. The only person information in our corpus tational Linguistics, ACL, pages 1557–1567.
is the authors’ names and their affiliations in the
USENIX Security proceedings. However, we do Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
not expose their identities nor use the information Luke Zettlemoyer, and Veselin Stoyanov. 2019.
in this work. Roberta: A robustly optimized bert pretraining ap-
proach. arXiv preprint arXiv:1907.11692.
119
Sara Mumtaz, Carlos Rodriguez, Boualem Benatallah,
Mortada Al-Banna, and Shayan Zamanirad. 2020.
Learning word representation for the cyber security
vulnerability domain. In International Joint Confer-
ence on Neural Networks (IJCNN), pages 1–8.
Yifan Peng, Shankai Yan, and Zhiyong Lu. 2019. Trans-
fer learning in biomedical natural language process-
ing: An evaluation of BERT and elmo on ten bench-
marking datasets. In Proceedings of the 18th BioNLP
Workshop and Shared Task, BioNLP@ACL, pages 58–
65.
Peter Phandi, Amila Silva, and Wei Lu. 2018. Semeval-
2018 task 8: Semantic extraction from cybersecurity
reports using natural language processing (securenlp).
In Proceedings of The 12th International Workshop
on Semantic Evaluation, SemEval@NAACL-HLT
2018, pages 697–706.
120
A Details on Model Training
Term CTI-BERT bert-base-uncased
apt* apt, apt1, apt10, apt28, apt29, apt41, apts apt
backdoor* backdoor, backdoored, backdoors –
*bot abbot, agobot, bot, gaobot, ircbot, ourbot, qakbot, qbot, rbot, robot, sabot, abbot, bot, robot, talbot
sdbot, spybot, syzbot, trickbot, zbot
*crime* crime, crimes, crimeware, cybercrime crime, crimea, crimean,
crimes
crypto* crypto, cryptoc, cryptocurr, cryptocurrencies, cryptocurrency, cryptograph, –
cryptographers, cryptographic, cryptographically, cryptography, cryptojack-
ing, cryptol, cryptolocker, cryptology, cryptom, cryptomining, cryptosystem,
cryptosystems, cryptow, cryptowall
cyber* cyber, cyberark, cyberattack, cyberattackers, cyberattacks, cyberb, cybercri, cyber
cybercrime, cybercrimes, cybercriminal, cybercriminals, cyberdefense, cybere,
cybereason, cyberespionage, cybers, cybersec, cybersecurity, cyberspace,
cyberthre, cyberthreat, cyberthreats, cyberwar, cyberwarfare, cyberweap
dark* dark, darknet, darkreading, darks, darkside dark, darkened, dark-
ening, darker, darkest,
darkly, darkness
hijack* [hijack, hijacked, hijacker, hijackers, hijacking, hijacks –
key* key, keybase, keyboard, keyboards, keychain, keyctl, keyed, keygen, key- key, keyboard, key-
ing, keylog, keylogger, keyloggers, keylogging, keynote, keypad, keyring, boardist, keyboards,
keyrings, keys, keyspan, keyst, keystone, keystore, keystream, keystro, keynes, keynote, keys,
keystroke, keystrokes, keytouch, keyword, keywords keystone
*kit applewebkit, bootkit, kit, rootkit, toolkit, webkit bukit, kit
malware* malware, malwarebytes, malwares –
*net botnet, cabinet, cnet, darknet, dotnet, ethernet, fortinet, genet, honeynet, inet, barnet, baronet, bon-
internet, intranet, kennet, kinet, kuznet, magnet, monet, net, phonet, planet, net, cabinet, clarinet,
stuxnet, subnet, technet, telnet, vnet, x9cinternet, zdnet ethernet, hornet, inter-
net, janet, magnet, net,
planet
trojan* trojan, trojanized, trojans trojan
*virus* antivirus, coronavirus, virus, viruses, virusscan, virustotal virus, viruses
web* web, webapp, webapps, webassembly, webc, webcam, webcams, webcast, we- web, webb, webber, we-
bcasts, webclient, webcore, webd, webdav, webex, webgl, webhook, webin, ber, website, websites,
webinar, webkit, webkitbuild, webkitgtk, weblog, weblogic, webm, web- webster
mail, webmaster, webpage, webpages, webresources, webroot, webrtc, webs,
websense, webserver, webshell, website, websites, websocket, webspace,
websphere, webtools, webview
*ware adware, antimalware, aware, beware, coveware, crimeware, delaware, de- aware, delaware, hard-
signware, firmware, foxitsoftware, freeware, hardware, malware, middleware, ware, software, unaware,
radware, ransomware, scareware, shareware, slackware, software, spyware, ware
unaware, vmware, ware, x9cmalware
121
B Details on Experiment Datasets
Train Dev. Test Total Entity Type Train Dev Test Total
# Sentences 754 355 382 1,491 Campaign 39 0 4 43
# Tokens 138,721 19,578 7,985 166,284 SecurityAdvisory 54 12 30 96
Vulnerability 401 50 86 537
Table 15: Summary of TRAM Data DomainName 169 3 16 188
EmailAddress 6 1 1 8
Endpoint 3 0 0 3
FileName 210 37 24 271
Train Dev Test Total Hash 93 5 3 101
IpAddress 37 0 2 39
# Documents 5,214 1,058 965 7,237 Network 3 0 0 3
# Tokens 635,220 133,546 106,084 874,850 URL 181 20 27 228
WindowsRegistry 9 0 0 9
Table 16: Summary of IoTSpotter Data AvSignature 99 13 10 122
MalwareFamily 554 53 47 654
Technique 334 39 76 449
ThreatActor 89 4 7 100
Train Dev. Test Total
# Sentences 9,424 1,213 618 11,255 Table 21: Entity Types and Distributions in the NER2
# Tokens 1,020,655 146,362 56,216 1,223,233 Dataset
122