Advances in NLP with Pre-Trained Models
Advances in NLP with Pre-Trained Models
Bonan Min*1 , Hayley Ross*2 , Elior Sulem*3 , Amir Pouran Ben Veyseh*4 ,
Thien Huu Nguyen4 , Oscar Sainz5 , Eneko Agirre5 , Ilana Heinz1 , and Dan Roth3
1
Raytheon BBN Technologies
{bonan.min, ilana.Heintz}@raytheon.com
2
Harvard University
[email protected]
3
University of Pennsylvania
{eliors, danroth}@seas.upenn.edu
4
University of Oregon
{apouran, thien}@cs.uoregon.edu
arXiv:2111.01243v1 [cs.CL] 1 Nov 2021
5
University of the Basque Country (UPV/EHU)
{oscar.sainz, e.agirre}@ehu.eus
1
corpus, and then perform a small amount of via PLMs for a broad range of NLP tasks. We dis-
task-specific fine-tuning for the task of inter- cuss limitations and provide directions for future
est. research in Section 6 and conclude in Section 7.
2
Figure 1: Three types of pre-trained language models. Model architecture illustrations are from Lewis et al.
(2020). For the encoder-decoder model, the corruption strategy of document rotation is shown. Alternatives
include sentence permutation, text infilling, token deletion/masking, etc.
and semantic role labeling (SRL). Collobert and introduced the Transformer architecture that can be
Weston proposed sharing the weights of their deep- used for language model pre-training. The Trans-
est convolutional layer – the word embeddings former’s multi-head self-attention mechanism al-
learned by the model – between the multiple train- lows every word to attend to all previous words or
ing tasks and fine-tuning the weights of the two every word except the target, allowing the model to
remaining two feed-forward layers for each indi- efficiently capture long-range dependencies with-
vidual task. out the expensive recurrent computation in LSTMs.
Pre-training and fine-tuning did not gain popular- Multiple layers of multi-head self-attention allow
ity in NLP until the advent of ELMo (Peters et al., for increasingly more expressive representations,
2018) and ULMFiT (Howard and Ruder, 2018). useful for a range of NLP problems. As a result,
Both models are based on Long Short-Term Mem- nearly all popular language models, including GPT,
ory architecture (LSTMs) (Hochreiter and Schmid- BERT, BART (Lewis et al., 2020) and T5 (Raffel
huber, 1997), but differ in significant ways. ULM- et al., 2020), are now based on the Transformer
FiT pre-trains a three-layer LSTM on a standard architecture. They also differ in a number of im-
language modeling objective, predicting the next portant ways, which we discuss in the following
token in a sequence. ELMo uses layers of bidirec- sections. For more details about the Transformer ar-
tional LSTMs that combine two language model chitecture, we refer the reader to the original paper
tasks in forward and backward directions to capture or to the excellent tutorials available3,4 .
context from both sides. Both proposed fine-tuning
2.2 Modern Pre-Trained Language Models
the language model layer by layer for downstream
application. Both studies also suggested adding There are three classes of pre-trained language
additional classifier layers on top of the language models: autoregressive language models (e.g.
model, which were fine-tuned alongside the lan- GPT), masked language models (e.g. BERT), and
guage model layers. These changes, combined encoder-decoder models (e.g. BART, T5). Fig-
with the substantially larger model size and pre- ure 1 shows the difference in model architecture
training corpus size compared to previous models, and training objectives with an example training
allowed the pre-training then fine-tuning paradigm input for each.
to succeed. Both ELMo and ULMFiT showed com- 2.2.1 Autoregressive Language Models
petitive or improved performance compared to the
An autoregressive language model is trained
then-state-of-the-art for a number of tasks, demon-
to predict the next word xi given all previ-
strating the value of language model pre-training
3
on a large scale. http://nlp.seas.harvard.edu/2018/04/
03/attention.html
The pace of this paradigm shift picked up dra- 4
http://jalammar.github.io/
matically in late 2018 when Vaswani et al. (2017) illustrated-transformer/
3
Model Pre-Training Sources Size of Pre-Training Corpus # Model parameters
(1) English Monolingual Models
BERT(BASE )(Devlin et al., 2019) Wiki, books 3.3B tokens (13GB data) 110M
BERT(L ARGE )(D EVLIN ET AL ., 2019) Wiki, books 3.3B tokens (13GB data) 340M
RO BERTA(Liu et al., 2019) Wiki, books, web crawl 161GB data 340M
XLN ET (Yang et al., 2019) Wiki, books, web crawl 142GB data 340M
GPT(Radford et al., 2018) Web crawl 800M tokens 117M
GPT-2(Radford et al., 2019) Web crawl 8M documents (40GB data) 1.5B
GPT-3(Brown et al., 2020) Wiki, books, web crawl 500B tokens 175B
BART (Lewis et al., 2020) Wiki, books 3.3B tokens ∼370M
T5 (Raffel et al., 2020) Web crawl 200B tokens (750GB data) 11B
(2) Multilingual Models
M BERT(Devlin et al., 2019) Wiki 21.9B tokens 172M
XLM-R( BASE ) (Conneau et al., 2020) Web crawl 295B tokens 270M
XLM-R( LARGE ) (Conneau et al., 2020) Web crawl 295B tokens 550M
M T5 ( LARGE ) (Raffel et al., 2020) Web crawl 6.3T tokens 1.2B
M T5 (XXL) (Raffel et al., 2020) Web crawl 6.3T tokens 13B
Table 1: Training sources, dataset size, and model parameters for popular PLMs. Data sources differ, and are
described in the citations listed in each row.
ous words x1 , x2 , ..., and xi−1 . The train- et al., 2020), however, do not opt for the fine-tuning
ing objective is to maximize the log-likelihood approach and instead leverage GPT’s generative de-
i log(P (xi |x1 , x2 , ..., xi−1 ); θT ), in which θT
P
sign to tackle tasks in a prompt-based manner or
are the model parameters. In a Transformer de- via outright language generation, as described in
coder, these are in multiple layers of multi-head Sections 3 and 4.
self-attention modules. Typical models include
2.2.2 Masked Language Models
GPT (Radford et al., 2018), GPT-2 (Radford et al.,
2019) and GPT-3 (Brown et al., 2020)5 . Whereas autoregressive models are unidirectional,
GPT only utilizes the autoregressive decoder masked language models (MLMs), predict a
portion of the Transformer architecture, stacking “masked” word conditioned on all other words
multiple transformer decoder layers with masked in the sequence. When training an MLM,
self-attention. This allows the model to attend to words are chosen at random to be masked, us-
all previous tokens in the sequence when predict- ing a special token [MASK], or replaced by
ing the next token. Each newer version of GPT a random token. This forces the model to
is trained with increasingly large amounts of text collect bidirectional information in making pre-
(Table 1). dictions. The training objective is to recover
the original tokens at the masked positions:
The GPT paper (Radford et al., 2018) proposed
i mi log(P (xi |x1 , ..., xi−1 , xi+1 , ..., xn ); θT ), in
P
fine-tuning GPT for specific tasks, providing exam-
which mi ∈ {0, 1} indicates whether xi is masked
ples for natural language inference, QA (including
or not, and θT are the parameters in a Transformer
commonsense reasoning), semantic similarity and
encoder. Note that in BERT and similar models,
paraphrase detection, sentiment analysis, and lin-
it is a common practice to mask multiple words
guistic acceptability (CoLA, Warstadt et al., 2019),
from a sequence to allow parallel training. Popular
as well as the GLUE benchmark. In particular,
examples of MLMs include BERT (Devlin et al.,
GPT achieves a dramatic improvement on CoLA
2019), RoBERTa (Liu et al., 2019), and XLM-R
(scoring 45.4 compared to the previous state of
(Conneau et al., 2020).
the art of 35.0), showcasing the model’s ability to
Specifically, MLMs such as BERT use the en-
gain a much more sophisticated grasp of language
coder portion of the Transformer architecture. Like
than previous models. Subsequent versions of GPT
autoregressive models, MLMs stack multiple trans-
(GPT-2 and GPT-3, Radford et al., 2019; Brown
former encoder layers to learn increasingly com-
5
Open-source re-implementations of GPT are also avail- plex and meaningful representations, but it uses
able, such as GPT-Neo (Black et al., 2021) and GPT-J (Wang, masked self-attention to attend to all other tokens
2021), trained on an 800GB open-source dataset (Gao et al.,
2020a), with model sizes similar to GPT-2 (2.7B and 6B pa- in the sequence in both directions when learning
rameters respectively). a representation for a particular token. The non-
4
autoregressive nature allows the computation to be guage models. There is a clear trend of increasing
parallelized, so it is often more efficient at infer- the size of the pre-training corpus as well as in-
ence time. Dynamic unfolding of all positions in creasing the diversity of the data. For example,
relation to the masked word provides efficiency at ULMFiT (Howard and Ruder, 2018) is trained on
training time. a small, highly pre-processed corpus of ∼29,000
There is a large family of models derived Wikipedia articles (103 million words), and is rep-
from BERT, including RoBERTa (Liu et al., resentative of models of that year. A few years later,
2019), which improves BERT’s pre-training, AL- models such as XLM-R (Conneau et al., 2020) and
BERT (Lan et al., 2020), which is smaller and GPT-3 (Brown et al., 2020) leveraged billions of
faster to train, and XLNet (Yang et al., 2019) and words of crawled web data (diverse in nature). Raf-
Transformer-XL (Dai et al., 2019), which incor- fel et al. (2020) observe that the primary gains in
porate an autoregressive pre-training approach to performance are typically driven by model size and
better handle long-distance dependencies. There dataset size (“the bigger, the better”), if the qual-
are also a range of derived models trained on spe- ity of the dataset is held constant. They find that
cific domains (Table 6 in Appendix A). See Qiu quality can play a larger role if there is a genre
et al. (2020) for a full taxonomy of BERT-derived match to the task, but a larger dataset provides
models. more advantages, eventually overcoming any gain
from quality. For a detailed discussion of model
2.2.3 Encoder-Decoder Language Models performance scaling by model size, dataset size,
The encoder-decoder model is a more flexible “text and other factors, see Kaplan et al. (2020). De-
in, text out” model that learns to generate a se- spite the advantages of the larger dataset, Raffel
quence of token y1 , ..., yn given an input sequence et al. (2020) also demonstrate the importance of
x1 , ..., xm . Given a pair of sequences, the train- cleaning large crawled datasets. They show that
ing objective is to maximize the log-likelihood of a model trained on such an unfiltered dataset per-
log(P (y1 , ..., yn |x1 , ..., xm ); θT ), in which θT are forms substantially worse than if filtering heuristics
the parameters in a full encoder-decoder Trans- are applied. Similarly, GPT-2 (Radford et al., 2019)
former model (Vaswani et al., 2017). and GPT-3 (Brown et al., 2020) use heuristics to
To generate adequate data for self-supervised improve the quality of the training data. However,
pre-training, researchers experiment with different Hendrycks et al. (2020) noted that larger models do
forms of sequence corruption. The input is a token not necessarily perform better out of domain. Lin
sequence modified in some particular way, and et al. (2021) also observe that larger language mod-
the output is the reconstructed, original sequence. els (trained on these very diverse sources) are more
Forms of sequence corruption include document likely to incorrectly answer questions that some hu-
rotation, shown in Figure 1, sentence permutation, mans would answer incorrectly due to false beliefs
text infilling, token deletion/masking, and others. or misconceptions, thus mimicking the inaccura-
Representative models include BART (Lewis et al., cies in their training data.
2020) and T5 (Raffel et al., 2020). The domain of intended downstream applica-
Given the sequence-to-sequence (seq2seq) na- tions is an important consideration for pre-training
ture, it is straightforward to fine-tune the encoder- source data selection. Table 6 (Appendix A) pro-
decoder language model to perform seq2seq tasks vides a list of domain-specific pre-trained language
such as Machine Translation, style transfer, and models that achieved significantly better perfor-
text summarization. The seq2seq formulation is mance in the intended domain than general-purpose
also versatile: many tasks can be reformulated as language models. These models are either trained
“text in, text out”. We describe those approaches in from scratch or trained with domain-specific text
details in Section 4. using a general-purpose model as the initialization.
2.3 Pre-Training Corpora
2.4 Fine-Tuning: Applying PLMs to NLP
The pre-training corpus is a primary distinguishing
Tasks
factor between language models. Both the size and
the quality (source data characteristics) are impor- Having described the various approaches to creat-
tant considerations. Table 1 presents the sources ing complex, meaningful representations through
and the corpus size used for several popular lan- pre-training, we turn to the fine-tuning step that
5
Figure 2: Typical “pre-train then fine-tune” strategies. We illustrate strategies that fine-tune the full PLM (left),
fine-tune the full PLM in a custom model (center), and fine-tune just a small adapter sub-layer per each Transformer
layer (right). We show the Transformer blocks that will be fine-tuned for the specific tasks in blue, and the frozen
blocks (keep the pre-trained weights unchanged) in grey. For brevity, we represent the entire Transformer block
(stacked in n layers) by its multi-head self-attention and (if applicable) adapter layers. We refer interested readers
to Vaswani et al., 2017 and Pfeiffer et al., 2020a for more architecture details. “Heads” refers to task-specific
prediction functions (Wolf et al., 2020).
allows PLMs to perform accurately on disparate make use of the frozen PLM technique to help
NLP tasks. Figure 2 illustrates typical pre-training reduce training complexity. Examples are con-
then fine-tuning strategies. We describe each of stituency parsing (Zhang et al., 2020c), semantic
them below. A more comprehensive list of prior graph parsing using UCCA6 (Jiang et al., 2019)
work using different pre-training then fine-tuning and AMR7 (Zhang et al., 2019b; Naseem et al.,
strategies are in Table 8 (Appendix B). 2019; Zhou et al., 2020b), Aspect-Based Sentiment
Analysis (Li et al., 2019b) and Machine Transla-
2.4.1 Contextual Embeddings tion (Zhu et al., 2020). For instance, Zhang et al.
The simplest approach to using large pre-trained (2020c) uses frozen BERT embeddings to seed an
language models is to “freeze” the model and use innovative approach to Conditional Random Field
its output as sophisticated, context-sensitive word (CRF) modeling (Lafferty et al., 2001) that replaces
embeddings for a subsequent architecture, which the inside-outside algorithm with backpropagation,
is trained from scratch for the specific task. In using a two-step process to first bracket then label
other words, while this still involves a forward pass the parses, and a batched version of the CKY al-
through the pre-trained language model over the gorithm. For complex tasks like these, there may
input text, the language model’s weights are not only be enough data or compute power available
fine-tuned, rendering this approach closer to a fea- to train the secondary model (Zhang et al. (2019b)
ture extraction family of approaches in classic sta- cited limitations in compute power). While the use
tistical NLP. There are three types of scenarios for of frozen PLM parameters is currently in vogue for
using frozen PLMs. these tasks, perhaps due to researcher preference
In contexts with insufficient labeled data or com- for simplicity as well as computational require-
pute power, “frozen” contextual embeddings are ments, we may see a shift to full-model fine-tuning
employed. For non-benchmark tasks, the only la- for tasks with sufficient training data.
beled training datasets are too small to fine-tune Unsupervised tasks such as word sense disam-
even the top layers of BERT-base, let alone larger biguation (Hadiwinoto et al., 2019) and word sense
models. The computational cost of fine-tuning the induction (Amrami and Goldberg, 2019) are not
entire PLM may be prohibitive for some applica- associated with a supervised dataset for fine-tuning.
tions or developers, leading to use of the more Instead, frozen BERT embeddings are fed through
efficient frozen PLM solution. Other data-efficient a variety of strategies such as nearest-neighbour
and time-efficient approaches to fine-tuning are dis- 6
Universal Conceptual Cognitive Annotation (Abend and
cussed in Section 2.4.4. Rappoport, 2013)
7
Highly complex or difficult NLP tasks often Abstract Meaning Representation (Banarescu et al., 2013)
6
matching, affine transformations, gated linear units ers temporarily while initially training the feed-
(GLU, Dauphin et al., 2017) or clustering algo- forward layers, then unfreeze the language model
rithms to perform these tasks. gradually for additional fine-tuning (Howard and
Ruder, 2018; Yang et al., 2019). The degree to
2.4.2 Fine-tuning the PLM which this should be done depends on the size of
This approach fine-tunes some or all the layers of feed-forward layers, and whether a token such as
the PLM and then adds one or two simple out- BERT’s [CLS] is being used. If the majority of
put layers (known as prediction heads, Wolf et al., the labour is being done by [CLS], as in all the
2020). Typically, these are feed-forward layers for examples in Devlin et al. (2019), there are fewer
classification. The output layers and the PLM are benefits to training the feed-forward layer alone.
trained together in an end-to-end setup, but the bulk Again, this is a function of the availability of super-
of the computation is applied to fine-tuning the lan- vised training data.
guage model to produce the desired representation The next choice is how many layers of the PLM
of the input. The task of the output layers is merely to fine-tune. While the examples in the BERT pa-
to condense the information provided by the em- per fine-tune the entire model, this is not feasible
beddings of each token into the number of desired for NLP tasks with small datasets or in situations
classes. The word embeddings may come from the where compute power is a limitation. Often, tun-
top layer, or from a concatenation or a weighted ing just the top few layers of the language model
average of the top n (often n = 4) layers (Peters is sufficient; for example, Ross et al. (2020) only
et al., 2018). Figure 2 (left) shows an illustration fine-tune the top layer of BERT on their small su-
of this approach. pervised dataset of 2000 sentences. A range of pa-
This approach is most suitable for sequence clas- pers in the growing field of “BERTology” (Tenney
sification tasks (e.g. sentiment analysis, NLI, se- et al., 2019, Clark et al., 2019b, Rogers et al., 2020)
mantic similarity), sequence tagging tasks such as show that the lower layers of BERT contain word-
NER, and span extraction tasks (e.g. QA) in which specific and syntactic information such as part of
the newly trained layers learn the start and end span speech, while the upper layers contain more seman-
of an answer. tic and increasingly complex information such as
For sequence classification tasks, Devlin et al. semantic roles and coreference information.
(2019) suggests fine-tuning BERT’s representation
of the special [CLS] token, and following with a 2.4.3 Fine-tuning the PLM in Customized
single feed-forward layer that classifies it as one Models
of the task labels. For token-level or span-level Some tasks require significant additional architec-
classification tasks, the representations of each to- ture on top of a language model, as illustrated in
ken, or alternatively just the representation of the Figure 2 (center). With sufficient training data and
first sub-token of each token or span (as in De- computational power, researchers may choose to
vlin et al., 2019), may be passed to the classifier. train both a substantial task-specific architecture
This fine-tuning approach is use to apply BERT and also fine-tune the language model. This is
to all 11 tasks in GLUE, as well as QA (SQuAD), the preferred choice for structure prediction tasks,
NER (CoNLL 2003), and common-sense inference in particular parsing tasks and occasionally se-
(SWAG). For many additional examples of this quence tagging tasks. Examples of sequence tag-
highly popular approach, see Table 8 (Appendix ging models using this approach include BERT-
B). CRF for NER (Souza et al., 2020b; Taher et al.,
In this setting, care is needed to choose an appro- 2019), though notably Devlin et al. (2019) show
priate learning rate that works for both the weights that the Conditional Random Field (CRF) layer is
of the feed-forward layer(s) and for the PLM. Since not necessarily needed for NER with BERT. Ex-
the PLM is already largely trained, a low learning amples of parsing models using this approach in-
rate should be used (between 1e-3 (Raffel et al., clude UDapter for dependency parsing (Üstün et al.,
2020) and 1e-5 (Liu et al., 2019)), with a lower 2020).
learning rate for smaller datasets. However, the Any sequence-to-sequence task that uses a pre-
randomly initialized feed-forward layer weights trained language model as its encoder may em-
still require significant training. As such, it is a ploy this approach. An interesting example is Zhu
common practice to freeze the language model lay- et al. (2020)’s formulation of machine translation.
7
However, Zhu et al. did not find any significant in the adapters. One set of adapters is fine-tuned
improvement over using BERT-based frozen word per task of interest. This approach is more efficient
embeddings. in training (typically < 5% of all PLM weights),
A related and highly successful approach is to and allows efficient weight-sharing, both in terms
fine-tune the entire language model with a small of using the same frozen PLM for each task, and in
number of feed-forward layers, then layer on an allowing the weights of adapter modules to be dis-
algorithmic approach that provides a substantial tributed and also re-used. Notably, the weights of
amount of task-specific heavy lifting. For example, adapters independently trained for different tasks
it might transform the task from a classification can be successfully combined to solve a new task
problem (as understood by the language model) (Pfeiffer et al., 2020b). Finally, catastrophic for-
into the desired target formulation, often a struc- getting of old capabilities when fine-tuning on a
tured form such as a tree or a set of clusters. For new task or language is prevented. AdapterHub
coreference resolution, Joshi et al. (2019, 2020) (Pfeiffer et al., 2020a) and Trankit (Nguyen et al.,
adds a substantial algorithm, in their case e2e-coref 2021) are examples of frameworks promoting an
(Lee et al., 2018) which transforms ratings of pairs adapter ecosystem; an example of using adapters
of spans into valid mention clusters. Specifically, for Universal Dependency Parsing is Üstün et al.
for each candidate mention span, the algorithm (2020).
computes a distribution over possible antecedent A similar method is side-tuning (Zhang et al.,
spans from the mention score (whether it is likely 2020b), which adapts a pre-trained network by
to be a mention) and the compatibility score of training a lightweight “side” network that is fused
the two spans, which itself involves a feed-forward with the (unchanged) pre-trained network using
network to compute. Two more structural pars- a simple additive process. Also closely related
ing examples in this vein are temporal dependency is diff-pruning (Guo et al., 2021), which adds a
parsing (Ross et al., 2020) and modal dependency sparse, task-specific difference vector to the orig-
parsing (Yao et al., 2021). These studies approach inal (frozen) parameters. These difference vec-
tree building algorithmically by first performing a tors are regularized to be sparse, which further de-
classification problem to identify suitable depen- creases the number of weights that need to be stored
dency pairs, then ranking them to construct a valid (around 0.5% of the original model’s parameters).
tree.
Moving to the second type of approach, Bit-
2.4.4 Efficient Fine-tuning Approaches Fit (Zaken et al., 2021) proposes to limit fine-tuning
to the bias terms (or a subset of the bias terms,
A wide range of approaches, in addition to limiting
around 0.1% of the total parameters) of pre-trained
fine-tuning to the top layers, seek to fine-tune only
BERT models, plus a task-specific classification
a small number of model weights. These can be
layer. This is shown to be competitive with, and
classified into two types: (a) fine-tuning a separate,
for some tasks better than, fine-tuning all of BERT.
small network that is tightly coupled with the PLM
BitFit builds on the intuition that fine-tuning ex-
(but does not change it), and (b) selecting only a
poses existing capabilities, rather than teaching the
small number of the PLM’s weights to fine-tune or
model new ones.
keep.
The most prominent approach of the first type Similarly, Radiya-Dixit and Wang (2020) show
are adapter modules (Houlsby et al., 2019; Bapna that it suffices to fine-tune only the “most sensitive”
and Firat, 2019; Pfeiffer et al., 2020b,a), as illus- layers, i.e. those which are most distant in param-
trated in Figure 2 (right). Adapters add a small eter space from the rest of the model. In parallel,
set of newly initialized weights at every layer of they sparsify the model substantially by setting 1-
the transformer. Houlsby et al. (2019) show that a 4% of pre-trained parameters to zero. This retains
two-layer feed-forward network with a bottleneck performance, as also demonstrated by work like
works well. The placement and configuration of DistilBERT (Sanh et al., 2020) and other pruning
the adapters within the Transformer blocks varies studies (Prasanna et al., 2020 inter alia) which show
in the literature (Houlsby et al., 2019; Bapna and that many parameters in a large PLM are redundant.
Firat, 2019; Stickland and Murray, 2019; Pfeiffer In fact, Zhao et al. (2020a) propose masking, i.e.
et al., 2020b). During fine-tuning, all weights in setting weights to zero, as a sole alternative to fine-
the PLM remain frozen except for the few weights tuning the model weights. This approach freezes
8
all the weights of the PLM, selects the weights Input Output
(1) Text pair generation (Schick and Schütze, 2021)
that are relevant for a given task, and masks (dis- Task: Write two sentences “He’s playing a flute.”
cards) the rest. They train one mask per down- that mean the same thing.
stream task, with every layer masked except the Sentence 1: “A man is play-
ing a flute.” Sentence 2:
embedding layer. While in principle this trains as (2) Mathematical reasoning (Reynolds and McDonell, 2021)
many parameters as the original model, the mask f (x) = x ∗ x. What is f (f (3)) = f (3 ∗ 3) = 3 ∗ 3 ∗ 3 =
is both binary and sparse and thus much simpler to f (f (3))? Let’s solve this 27. We can see that f (3) = 3 ∗
problem by splitting it into 3 = 9, so f (f (3)) = 27.
learn and store. The initial sparsity of the mask is steps.
an important hyperparameter in this approach, as
is deciding which layers to mask, since the differ- Table 2: Example prompt designs for learning from in-
ent layers encode various degrees of syntactic and structions.
semantic knowledge (Tenney et al., 2019). Zhao
et al. show that masking “top-down” (mostly the
3.1 Learning from Instructions and
top layers, which are more task-specific and encode
Demonstrations
more semantic and long-distance information) is
more effective than masking “bottom-up” (which First attempts made use of instructions such as
would mask mostly the layers dealing with ele- “translate X to Y:” (Raffel et al., 2020) to simultane-
mentary word meaning and syntax). In particular, ously teach the model varied tasks in a text-to-text
performance on CoLA increases as more layers are manner. However, this approach required a large
masked top-down. The authors further show that amount of labeled data.
masking yields entirely comparable performance With the emergence of large generative PLMs
to fine-tuning on a range of tasks from POS tagging (Radford et al., 2019), the first signs that language
to reading comprehension. models are multi-task learners emerged. For in-
stance, GPT-2 understands that if the instruction
3 Paradigm 2: Prompt-based Learning “TL;DR” (“too long; didn’t read”) is given, then
it should generate a summary of the context fol-
We use prompting to refer to the practice of adding lowing the instruction. More recently, and with
natural language text, often short phrases, to the even larger generative PLMs (GPT-3), Brown et al.
input or output to encourage pre-trained models to (2020) showed that those models are indeed very
perform specific tasks (Yuan et al., 2021). There are good at few-shot learning. Brown et al. showed
several advantages to using prompts. Prompting, that GPT-3 can perform few-shot tasks via priming
especially in-context learning (e.g. Brown et al., (in-context learning): given instructions and a few
2020), may not require updates to the PLM’s pa- input/output pairs, GPT-3 is able to produce the de-
rameters, reducing computational requirements as sired outputs for new inputs. No gradient updates
compared to fine-tuning approaches, or in addition are performed (see Figure 3, left box). Caveats
to those described in 2.4.4. Prompts also encour- include the requirement of a very large LM to work
age a better alignment of the new task formulation well, and an inability to scale to more than a few ex-
with the pre-training objective, leading to better amples, because the context window of most LMs
use of knowledge captured in pre-training. The is limited to a few hundred tokens. We refer readers
closer match also enables a few-shot approach (Liu to the GPT-3 paper (Brown et al., 2020) for many
et al., 2021b), especially for tasks with small train- additional examples of learning from instructions
ing datasets; a good prompt can be worth hundreds and/or demonstrations.
of labeled data points (Le Scao and Rush, 2021). Schick and Schütze (2021) and Reynolds and
Finally, prompts allow probing of the PLMs, of- McDonell (2021) introduce new tasks based on de-
ten in an unsupervised way, in order to assess the scriptions. For example, the text pair generation
knowledge acquired by the PLM for specific tasks task Schick and Schütze (2021) consists in gener-
of interest (e.g. Petroni et al., 2019). ating a continuation sentence (Sentence 2) given
We discuss 3 types of prompt-based learning ap- an input sentence (Sentence 1) and a description of
proaches below: Learning from instructions and the relations between the sentences (Table 2(1)).8
demonstrations, template-based learning, and learn- 8
A variation of this task consists in generating both Sen-
ing from proxy tasks. Figure 3 shows illustrations tence 1 and Sentence 2 given the description (Schick and
for each of the three approaches. Schütze, 2021).
9
Figure 3: The three main prompt-based approaches. On the instruction based learning (left box) the instructions are
marked in purple, the in-context examples in blue and the prompt in cyan. On the prompt based learning (middle
box), the text to classify is marked on light cyan and the prompt on dark cyan; the label verbalizations are shown
in small boxes. On the proxy task based learning (right box), prompts are marked with dark cyan, the context is on
light cyan and the answers generated by the model are in blue.
To address this task, Schick and Schütze (2021) 137B parameters), but hurts performance when ap-
use a generative PLM (GPT2-XL) that generates plied to PLMs with 10B parameters or less. In a
Sentence 2, replacing the token. Impressively, similar setting, Sanh et al. (2021) showed that it is
even mathematical reasoning can be handled (Ta- possible for a model with 11B parameters to bene-
ble 2(2)): Reynolds and McDonell (2021) show fit from instruction tuning, and identified three key
that by inserting a natural language prompt (“Let’s differences compared to Wei et al. (2021). (1) They
solve . . . steps.”) after the math problem statement, use a encoder-decoder model trained first with the
GPT-3 can generate a procedure that solves the MLM objective, then as a standard LM, and finally
math problem. fine-tuned on a multitask objective, rather than a
Recently, Wei et al. (2021) showed that teaching decoder-only autoregressive LM. (2) They argue
a very large PLM to follow instructions with su- that their prompts are qualitatively more diverse in
pervised data improves the zero and few-shot abili- terms of length and creativity. (3) They hold out
ties of these PLMs. They carried out a large scale multiple tasks at once, rather than only one at a
multi-task experiment over more than 60 datasets time.
grouped into 12 task different tasks, and showed We note that the descriptions in instruction learn-
that a PLM trained via natural language instruc- ing can be very detailed. For example, the crowd-
tions on other tasks outperforms a standard lan- sourcing instructions in Mishra et al. (2021) contain
guage model on the test task. Mishra et al. (2021) the task definition, things to avoid, emphasis and
fine-tuned BART (Lewis et al., 2020) to perform a caution (i.e. required properties for the output), and
similar task using instructions and few-shot exam- positive and negative examples.
ples for a variety of crowd-sourced NLP tasks. The
crowdsourcing process of each task consists of sev- 3.2 Template-based Learning
eral steps that are natural and intuitive for human A more widely used approach, template-based
annotators. The instructions to the PLM match the learning, reformulates NLP tasks into tasks that
step-by-step crowdsourcing instructions, decom- are closer to language models’ pre-training tasks
posed into self-contained, separate tasks, leading to via template-based prompts. This better lever-
improved performance on unseen tasks, in contrast ages the knowledge captured in the pre-training
to an earlier work (Efrat and Levy, 2020) that re- tasks, leading to a significant reduction in the
ported negative performance when using the crowd- number of task-specific training examples required
sourcing instructions as-is. to achieve a similar performance to previous ap-
Scaling limitations may affect the broad appli- proaches(Le Scao and Rush, 2021), or even elim-
cability of this approach: Wei et al. (2021) show inating the need for training data. To achieve this
that instruction tuning achieves significant improve- goal, template-based learning reformulates various
ments on held-out tasks in the zero-shot setting NLP tasks into language modeling tasks via care-
when using very large PLMs (e.g. with 68B or fully designed templates with open slots. In this
10
Input Output or no) is directly mapped to one of the textual entail-
(1) Topic/sentiment classification (Schick and Schütze, 2021a)
Best pizza ever!. It was . great → Positive ment class labels. This template design allows PET
(2) Textual entailment (Schick and Schütze, 2021a) to reformulate the text entailment problem into the
Mia likes pie? , Mia hates No → Contradiction same masked language modeling problem that was
pie.
(3) Event argument extraction (Chen et al., 2020c)
used to pre-train the PLM. Therefore, it is popular
Americans sought to bring U.S. troops killed 17 people among classification tasks that may be reformu-
calm to Mosul, where U.S. with something in Mosul at lated as predicting a masked word or short phrase
troops killed 17 people earlier in the week.
in clashes earlier in the
(e.g. topic classification, textual entailment, and
week. someone killed knowledge probing). Chen et al. (2020c) reformu-
someone with something in lates the event argument extraction challenge as a
some place at some time.
cloze-style problem (Table 3 (3)), predicting fillers
(4) Probing for relations/facts (Petroni et al., 2019)
Dante was born in . Florence for the underlined positions, and then apply greedy
(5) Probing for commonsense (Trinh and Le, 2019) decoding to fill in the position incrementally.
The trophy doesn’t fit in the it → trophy:0.9 Petroni et al. (2019) similarly use the cloze-style
suitcase because it is too big it → suitcase:0.2
(6) Probing for reasoning (Talmor et al., 2020) task for relation/fact probing (Table 3 (4))
The size of an airplane is larger A multiple-choice style prompt proves useful
than the size of a house . A. for probing for commonsense knowledge. This
larger B. smaller
kind of template provides a selection of hypotheses
Table 3: Example prompt designs for template-based for the PLM, which selects its preferred answer.
methods. great → Positive means that the answer great For example, in Table 3 (5), Trinh and Le (2019)’s
will be converted to label Positive. For Chen et al. model selects trophy instead of suitcase to replace it
(2020c), each of the underlined words (e.g. someone) in the original sentence. Table 3 (6) shows work by
will be replaced with the underlined phrase on the out-
Talmor et al. (2020), expressing similar reasoning
put side by a PLM. “it → trophy: 0.9” means by replac-
ing the underlined pronoun it with trophy, the modified through a hypothesis-driven approach.
sentence has a likelihood score of 0.9 according to the Prefix prompts (Li and Liang, 2021; Ham-
PLM. bardzumyan et al., 2021; Lester et al., 2021) are
another common type of template. Prefixes are
task-specific vectors prepended to the input. They
way, solving the tasks is reduced to filling the slots do not correspond to actual words but consist of
with words or phrases using PLMs, and then pro- free parameters. Prefix prompts are usually the
jecting these outputs into the task-specific labels. best choice for tasks that require generating text
Template-based learning differs from instruction or predicting a next word or phrase, because the
learning (Section 3.1) in that templates are less prefix-based prompt design is consistent with the
detailed and do not explicitly describe the task. left-to-right nature of the autoregresive model (Liu
et al., 2021b).
3.2.1 Template Design
Prompts can be further augmented via adding
Using a cloze-style prompt design, inputs to a PLM demonstrations (demonstration learning) (Gao
are converted into a format such that solving the et al., 2021a). In that case, a few labeled exam-
NLP task only requires the PLM to predict missing ples are appended to the template to make it more
word(s). Table 3 (1) shows the most straightfor- informative, allowing the PLMs to better predicting
ward example of this approach, as applied in the the answer.
sentiment detection domain.
For classification tasks, each predicted word or 3.2.2 Template Construction
phrase is converted into a class label of interest. Templates can be either manually crafted or au-
For example, we can design a cloze-style prompt tomatically generated. We here survey the differ-
for a textual entailment task in which the goal is to ent methods for generating them as well as ways
predict the entail/contradict relation between a pair to combine and manipulate the template-based
of input sentences hX1 , X2 i. Pattern-Exploiting prompts (multi-prompt learning).
Training (PET) (Schick and Schütze, 2021a) (Ta-
ble 3 (2)) converts a pair of inputs hX1 , X2 i into Manually-crafted Templates. Most early work
“X1 ? , X2 ” and asks a masked language model to in prompt-based learning uses some form of manu-
predict the missing word. The prediction (here yes ally crafted templates. For example, manual cloze
11
templates are used in Petroni et al. (2019) to probe them to fine-tune just 0.1% of the total model
the knowledge of the model, as well as in Schick parameters. A similar method is used by Lester
and Schütze (2020), Schick et al. (2020) and Schick et al. (2021), who differ from Li and Liang (2021)
and Schütze (2021a) for text classification in a few- by adding special tokens to form a template and
shot setting. Manually designed prefix prompts are tuning the embeddings of these tokens directly,
leveraged in Brown et al. (2020) for QA, transla- without introducing additional tunable parameters
tion, and probing tasks for commonsense reason- within each network layer. Continuous prefix
ing. The quality of the prompts impacts perfor- tuning is also used by Tsimpoukelli et al. (2021)
mance. Indeed, Zhao et al. (2021) showed that in the context of multimodal learning (language
different prompts can cause accuracy to vary from and vision) but in that case the prefix is sample
near chance to near state-of-the-art. dependent. Tuning can be initialized with discrete
prompts as in Zhong et al. (2021b), Qin and
Automatically-generated Discrete Templates. Eisner (2021) and Hambardzumyan et al. (2021).
Discrete templates, which usually correspond to It can also be done by inserting some tunable
natural language phrases, are described in a discrete embeddings into a hard prompt template as in
space. To search for such templates given a set of Liu et al. (2021c) and Han et al. (2021b), who
inputs and outputs, Jiang et al. (2021) proposed a propose prompt tuning with rules (PTR). This
mining-based approach called MINE that aims to uses manually crafted sub-templates to compose a
find either the middle words or dependency paths complete template using logic rules (see Section
between the inputs and outputs. A second approach 3.2.5 for its application to relation extraction).
(Jiang et al., 2021; Yuan et al., 2021) consists of It is worth noting that Logan IV et al. (2021)
paraphrasing an existing template prompt using showed that fine-tuning PLMs in the few-shot set-
back and forth machine translation, and then se- ting can avoid prompt engineering, and that one can
lecting the best prompt among the new paraphrases use prompts that contain neither task-specific tem-
with guidance from a thesaurus. Prompt paraphras- plates nor training examples, and even null prompts
ing is also used by Haviv et al. (2021) who used a that are simple concatenations of the inputs and the
neural prompt rewriter that optimizes the accuracy [MASK] token and still achieve competitive accu-
of systems using the prompt. In that case, a differ- racy on NLU tasks.
ent paraphrase is generated for each input. A third
approach uses gradient-based search to find short Multi-Prompt Learning A number of ap-
sequences that can serve as prompt (Wallace et al., proaches use prompt ensembling, augmentation,
2019; Shin et al., 2020). Gao et al. (2021a) and and decomposition/composition for a more flexible
Ben-David et al. (2021) further generate prompts task design. We describe them below.
using standard generation models such as T5 (Raf- First, multiple prompts can be used for an in-
fel et al., 2020). In the latter, the authors proposed put (dubbed prompt ensembling) at inference time.
a domain adaptation algorithm that trains T5 to The prompts can be combined using a uniform
generate unique domain relevant features that can average (Jiang et al., 2021; Schick and Schütze,
be concatenated with the input to form a template 2021a; Yuan et al., 2021) or a weighted average
for downstream tasks. (Jiang et al., 2021; Qin and Eisner, 2021; Schick
and Schütze, 2021a,b). Another way to combine
Automatically-generated Continuous Tem- the prompts is majority voting to combine the re-
plates. Continuous prompts, which perform sults of the different prompts as in Lester et al.
prompting directly in the embedding space (2021) and Hambardzumyan et al. (2021). Knowl-
of the model, allow us to abstract away from edge distillation (Allen-Zhu and Li, 2021), where
natural language prompts (i.e. the prompts do the idea is that the knowledge present in an ensem-
not correspond to actual words) and from the ble of models can be distilled into a single model,
parameters of the LM (Liu et al., 2021b). These has been borrowed to the context of prompt com-
continuous prompts often require tuning on bination by Schick and Schütze (2021a,b); Schick
task-specific data. Li and Liang (2021) propose and Schütze (2020) and Gao et al. (2021a) where
prefix tuning, which prepends a sequence of for each template-answer pair a separate model
continuous, task-specific vectors to the input while is trained, before ensembling them to annotate an
keeping the LM parameters frozen. This allows unlabeled dataset. Then, the authors train a new
12
model to distill the knowledge from the annotated may be optimized to produce ideal prompt re-
dataset. In the case of generation tasks, Schick and sponses. Jiang et al. (2021) used paraphrasing
Schütze (2020) trained a separate model for each to extend the search space with back translation
prompt. Then the model outputs were scored by (translating to another language, then back to the
averaging their generation probability across all original). Another approach, explored by Schick
models. and Schütze (2021a), Schick et al. (2020), Shin
Second, prompts can be decomposed or com- et al. (2020) and Gao et al. (2021a), is prune-then-
posed to more effectively solve an NLP task. De- search, a two-step method where the answer space
composition involves finding sub-problems for is pruned, for example by only selecting a subset
which prompts can be generated separately. For of words according to their zero-shot accuracy on
example, Cui et al. (2021) proposed an approach the training data (Gao et al., 2021a) and then an an-
for named entity recognition, where the different swer is searched in the pruned space. An approach
prompts for each candidate span were created and called label decomposition optimizes the search
predicted separately. space by modeling the label names for comparison
Third, augmentation methods such as demon- to the answer tokens; for instance, in Chen et al.
stration learning (Gao et al., 2021a) create more (2021d) the decomposed relation labels (their indi-
descriptive prompts, as in a multiple-choice prob- vidual tokens) represent the answer space. Lastly,
lem. Lu et al. (2021a) showed that both the choice Hambardzumyan et al. (2021) add a virtual token
of examples in the prompts and the order of the for each class label and optimize its embedding
prompts can considerably affect the results. To se- together with the token embeddings of the prompts,
lect the examples from which the PLM must choose using gradient descent. This gradient descent op-
the correct response (example sampling), Gao et al. timization approach allows direct optimization of
(2021a) and Liu et al. (2021a) used sentence em- the answers instead of using a discrete search.
beddings to find examples semantically close to
3.2.4 Task-specific Tuning
the input. Mishra et al. (2021) used both positive
and negative examples, teaching the PLM types of While prompts can be directly used in a zero-shot,
items to avoid in performing new tasks with only unsupervised setting, prompts have also been used
instructions. As for the order of the selected exam- in fully supervised or few-shot settings where ei-
ples (sample ordering), Kumar and Talukdar (2021) ther all or part of the specific-task training data is
searched for the best permutation of prompts and available. Two main approaches currently prevail
also learned a segmentation token to separate be- for tuning a PLM with prompts.
tween the prompts. They showed the usefulness The first approach uses a fixed template-style
of this method for few-shot learning on the task of prompt to perform tuning of the PLM. Here, a
sentiment classification. fixed template is usually applied to every training
and test example as in the PET-TC (Schick and
3.2.3 Answer Generation Schütze, 2021a), PET-Gen (Schick and Schütze,
There are two main types of answers to prompts: 2020) and LM-BFF (Gao et al., 2021a) models.
those that map to a classification label (e.g. Yin Le Scao and Rush (2021) quantified the benefit of
et al., 2019; Cui et al., 2021), and those intended using prompts in classification tasks by fine-tuning
as the final answer (e.g. Petroni et al., 2019; Jiang in equal conditions across many tasks and data
et al., 2020; Radford et al., 2019). For classifi- sizes. They showed that prompting consistently
cation tasks, typically addressed with cloze-style improves the results across tasks over just fine-
prompts, the developers identify a subset of words tuning, that it is most robust to the choice of pattern,
and phrases from which the PLM may choose, and and that it can be learned without an informative
that choice is easily mapped to the class of inter- verbalizer (a function that maps each label to a
est. For instance, in a sentiment detection task, the single vocabulary token). Logan IV et al. (2021)
PLM may answer a prompt with “good,” “great,” or showed that only tuning 0.1% of the parameters in
“excellent,” all of which are mapped to a “positive” the prompt-based few-shot setting can achieve com-
sentiment label. The second type of answer, free parable or better accuracy than standard fine-tuning.
text, prevails for text generation tasks. Examples For this purpose, they explored different ways to
of both types are shown in Table 3. perform memory-efficient fine-tuning, including (i)
In either case, the definition of the answer space Adapters (Houlsby et al., 2019), which are neural
13
network layers inserted between the feed-forward Information Extraction (IE). Cui et al. (2021)
portion of the Transformer architecture (see Sec- considered the NER task as a language model
tion 2.4.4); (ii) BitFit (Zaken et al., 2021), where ranking problem in a sequence-to-sequence frame-
only the bias terms inside the Transformer are up- work where the source sequence corresponds to
dated; (iii) PLM head tuning, where the embed- the original sentence and the target sequence corre-
dings in the MLM output layer that are associated sponds to the template prompt, filled by candidate
with the tokens of the verbalizer are updated; and spans. For the relation extraction task, Han et al.
(iv) Calibration (Zhao et al., 2021), where an affine (2021b) proposed a model called Prompt Tuning
transformation on top of the logits associated with with Rules (PTR), which applies logic rules to con-
the verbalizer tokens is learned. They found that struct prompts with several sub-prompts. Chen
the best results are achieved using BitFit. et al. (2021d), instead of using rules, constructed
The second approach is joint tuning of the the prompts by leveraging learnable virtual tem-
prompt and the PLM. Here, prompt-relevant pa- plate words and virtual answer words. Their repre-
rameters are fine-tuned together with the all or sentation is synergistically optimized with knowl-
some of the parameters of the PLM, as in PADA edge constraints. For the event extraction task in a
(Ben-David et al., 2021), where the prompts are cross-lingual setting, Fincke et al. (2021) proposed
properties of source domains, generated based on using the event type and an integer representing the
their relatedness to the input example (from a new argument type as prefixes.
domain), and P-Tuning (Liu et al., 2021c), which
makes use of trainable continuous prompt embed- Knowledge Probing. Factual probing has been
dings when applying GPT models on NLU tasks. explored in particular by Petroni et al. (2019) and
Finetuning both the model and the prompt-relevant Jiang et al. (2020) to quantify the amount of factual
parameters makes this approach very expressive. knowledge already present in the PLMs, providing
On the other hand, it requires the storage of all the the LAMA and X-FACTR datasets, respectively.
parameters, with makes it less applicable to small Other works that investigated model knowledge
datasets (Liu et al., 2021b). with discrete template search include Petroni et al.
It is worth noting that task-specific training can (2020), Jiang et al. (2021), Haviv et al. (2021),
also be used earlier during the construction and val- Shin et al. (2020) and Perez et al. (2021). Continu-
idation of the prompts. Indeed, as pointed out by ous template learning was used in Qin and Eisner
Perez et al. (2021), previous PLM-based few-shot (2021), Liu et al. (2021c) and Zhong et al. (2021b).
learning approaches used many held-out examples Prompt ensemble learning was applied to knowl-
to tune various aspects of learning, such as hyperpa- edge probing by Jiang et al. (2021) and Qin and
rameters, training objectives, and natural language Eisner (2021).
templates (“prompts”). Perez et al. (2021) propose In addition to factual knowledge, additional
instead to evaluate the few-shot ability of PLMs in types of knowledge that have been probed using
a true few-shot learning setting, where such held- the cloze test include commonsense (Trinh and
out examples are unavailable. Le, 2019), relational knowledge (Petroni et al.,
2019), reasoning (Talmor et al., 2020) and under-
3.2.5 Applications of Template-based standing rare words (Schick and Schütze, 2019).
Methods For commonsense reasoning, Winograd Schemas
Template-based prompting methods are currently (Levesque et al., 2012) require the model to iden-
applied to a growing list of NLP tasks. We provide tify the antecedent of an ambiguous pronoun within
a survey of how recent studies have addressed a context, or involve completing a sentence given
varied set of NLP applications. multiple choices. For commonsense knowledge
mining, Feldman et al. (2019) construct a candi-
Text Classification. In Puri and Catanzaro date piece of knowledge as a sentence, then use a
(2019), natural language descriptions of classifi- language model to approximate the likelihood of
cation tasks were given as input. Then, the model the text as a proxy for its truthfulness.
was trained to generate the correct answer in natural Prompts can also be used to explore the linguistic
language via a language modeling objective, aim- knowledge of PLMs, focusing on different phenom-
ing to generalize to new classification tasks without ena such as analogies (Brown et al., 2020), nega-
task-specific tuning. tion (Ettinger, 2020) or semantic similarity (Sun
14
et al., 2021). Linguistic evaluation of language problematic text (self-debiasing), using a textual
models (Linzen et al.; Gulordava et al., 2018; Gold- description of the undesired behavior.
berg, 2019; Tran et al., 2018; Bacon and Regier, Shin et al. (2021) explore the use of PLMs as
2019; McCoy et al., 2020; Linzen, 2020) usually few-shot semantic parsers. The authors use GPT-3
considers minimal pairs of grammatical and non- to convert text into a canonical text (in a controlled
grammatical sentences addressing a specific phe- sub-language) satisfying a grammar, that is then
nomenon that differs in a single place in the sen- automatically mapped to the target structured mean-
tence. To succeed, a model must score the grammat- ing representation.
ical sentence higher than its ungrammatical coun-
terpart. A main resource in this context is BLiMP 3.3 Learning from Proxy Tasks
(Benchmark of Linguistic Minimal Pairs, Warstadt Templates and prompts play a role again in an indi-
et al., 2020a) which provides minimal pairs for var- rect approach to NLP tasks called “proxy tasks”.
ious grammatical phenomena. Recently, the use of Examples for the use of this approach are emo-
this benchmark was adapted for language acquisi- tion classification or event and argument extraction,
tion research (Huebner et al., 2021): the authors both shown in Figure 3 (right box) with prompt-
probe a RoBERTa-based model pre-trained on tran- based proxy tasks. See Table 4 for additional ex-
scriptions of child-directed speech (MacWhinney, amples of proxy tasks and prompt design.
2000) to complete the benchmark task. The pref- The key distinction between learning from proxy
erence score can be calculated either holistically, tasks and previous methods is the use of supervised
summing the cross-entropy errors at each position Natural Language Understanding (NLU) tasks as
in the sentence (Zaczynska et al., 2020; Huebner a proxy instead of self-supervised language mod-
et al., 2021), or in an MLM-based way, where each eling for the target task. Indeed, taking advantage
candidate sentence is masked by a language model of large NLU datasets for extra supervision results
multiple times with the mask changing position. in better zero and few-shot performance in the tar-
The score is computed by summing the log-losses get task with relatively small PLMs (Wang et al.,
at the different masked positions (Salazar et al., 2021b), commonly RoBERTalarge at 345M param-
2020). eters. Knowledge-rich classification tasks in par-
ticular benefit from PLM proxy tasks, because the
Other tasks. The PET procedure (Schick and latter can reformulate the class label as a prompt,
Schütze, 2021a) was also applied to the Textual taking advantage of the meaning of class labels
Entailment task. QA is addressed in Khashabi et al. instead of treating them as indices. In this section,
(2020) with appropriate prompts from the context we describe the main proxy-task-based learning
and questions, formulating several QA tasks into approaches using QA (Section 3.3.1) and Textual
a unified text generation problem with encoder- Entailment (Section 3.3.2).
decoder pre-trained models such as T5.
Prompts have also been used for the evalua- 3.3.1 Question Answering as Proxy Task
tion of text generation. Yuan et al. (2021) used In a strong move away from traditional informa-
prompts in the BARTSCORE-PROMPT variant of tion extraction, recent studies replace modeling of
the BARTSCORE measure they propose that treats explicit entity, relation, and event classes with nat-
the evaluation of various text generation tasks as a ural language questions that get at the exact item
generation problem. In BARTSCORE-PROMPT, of interest. Questions can be used to probe for the
prompts are either appended to the source text or required information in the text.
prepended to the target text and are shown to be The choice of using QA as a proxy task is mo-
useful. For example, adding the phrase “such as” to tivated by the relative ease of answering simple
the translated text when using pre-trained models questions, as compared to performing expert anno-
significantly improves the correlation with human tation for complex linguistic phenomena.
evaluation on German-English machine translation In information extraction tasks, question
evaluation. prompts typically address identification and clas-
Schick et al. (2021) showed that PLMs are able sification jointly, by constructing the question to
to recognize the toxicity of the text they produce identify a particular type. For example, the ques-
(self-diagnosis). Then they proposed an algorithm tion “Who bought something?” will produce an
that permits the language model to produce less answer specific to the Buyer argument role in an
15
Application Work Task design Prompt design
Relation Extrac- Li et al. (2019a) Use question-answering to iden- Input: The armory is north of the music center. Prompt: Find a
tion tify the most appropriate entity facility near E1 ? E1 , physical, facility
span, given an incomplete text
and an indication of the class
type
Sainz et al. Use textual entailment to deter- Input: Gary’s car crash occurred in Houston; Prompt: Gary died
(2021) mine the likelihood of a can- in Houston
didate relation (such as Place-
OfDeath(X,Y) given an input
sentence.
Event Extrac- Du and Cardie Use a series of ordered ques- (1) Input: Donna purchased a new laptop; Prompt: What is the
tion (2020) tions, each leveraging the out- trigger? purchased (2) Prompt: What was purchased? laptop
put of the previous answer, to
find event triggers and appropri-
ate arguments
Topic and Senti- Yin et al. (2019) Use textual entailment to deter- Input: Dinosaurs and humans never coexisted. Prompt: This
ment Classifica- mine whether a topic name T is text is about T .
tion suitable for a text.
Puri and Catan- Use question answering to Input: Dinosaurs and humans never coexisted. Prompt: How is
zaro (2019) probe for a topic or sentiment the text best described? T1 , T2 , or T3
name from among a closed set
of responses.
Coreference Wu et al. Use question-answering to find Input: I arrived at the party with my tux on, and introduced
Resolution (2020b) a coreferent mention of a myself as George. I told them that <mention> I </mention>
marked mention from within was hired to do some Christmas music; Prompt: Who does it I
the same text. refer to?
Table 4: Examples of task design and example prompts for four different applications of prompt-based proxy tasks.
event of type Exchange-Ownership (see Figure 3, where natural questions for argument identifica-
right box). tion (“What plays the role?”) and argument clas-
Li et al. (2020c) formulates Named Entity sification (“What is the role?”) mutually improve
Recognition (NER) as a QA problem. For ex- each other.
ample, the prompt “which person is mentioned Chen et al. (2020c) reformulated event extrac-
in the text?” will identify a mention classified as a tion as a cloze task with QA model based on BERT
PERSON. The proposed BERT-based system per- and the SQuAD 2.0 dataset (Rajpurkar et al., 2018).
forms detection of multiple spans through the use Question answering is used directly, preserving
of separate binary classifiers identifying start and the QA format, in Du and Cardie (2020), Feng
end tokens. The authors incorporate synonyms and et al. (2020a), Li et al. (2020a), Zhou et al. (2021)
examples into the queries. and Liu et al. (2020a) for argument extraction, in-
Wu et al. (2020b) formulated coreference reso- cluding the argument identification and classifica-
lution as a span prediction task via QA, where a tion sub-tasks. In these cases the event extraction
query is generated for each candidate mention us- training data is converted to the QA format, where
ing its surrounding context, and a span prediction the questions are derived from the ontology. Liu
module uses the query to extract the coreference et al. (2020a) also experimented in a zero-shot set-
spans in the document. ting where no task-specific data is used for train-
Levy et al. (2017) first formulated relation ex- ing, only using prompts for probing. The zero-
traction as a QA task. This approach has been shot setting for the full event extraction pipeline
pursued in the context of PLMs by Li et al. (2019a) has been explored in Lyu et al. (2021) where QA-
and Zhao et al. (2020b). Han et al. (2021b) ad- based prompts are used for argument extraction and
dresses relation extraction with sub-prompts for en- prompts based on Textual Entailment (Dagan et al.,
tity recognition and relation classification, com- 2013) are used for trigger classification (see Sec-
posing them into a complete prompt using logic tion 3.3.1 below). Several ablation experiments an-
rules. Both types of questions are used to probe a alyzed the different components of the system such
QA system in a supervised setting to perform the as the choice of PLM, the choice of QA dataset and
two sub-tasks. Task decomposition is also used in the way to generate the questions (fixed vs. contex-
the work of Zhou et al. (2021) for event extraction tualized). It was shown in particular that RoBERTA
16
trained on QAMR (Michael et al., 2018) achieved expression.
the best results for argument extraction. Other aspects of QA-style proxy tasks are the
Identification-only sub-tasks such as trigger ability to use multiple questions, and to formulate
identification (Du and Cardie, 2020), are ad- questions in any style. In addition to sequential
dressed by more general questions, e.g. “What questions for determining event arguments, multi-
is the trigger?”. In contrast, Zhou et al. (2021) uses ple formulations of the same question may be used
separate questions to address the identification and in a weighted voting scheme to generate an ensem-
classification of arguments. ble answer Zhao et al. (2020b). The input to the QA
Du et al. (2021a) addressed slot-filling, which system need not necessarily include natural ques-
aims to extract task-specific slot fillers (for exam- tions. It may instead consist of pseudo-questions
ple, a flight date) from user utterances by formulat- such as keywords, synonyms, position index of la-
ing it as a QA task. In particular, they addressed bels, or a single word/type from the ontology or
the zero-shot slot-filling problem, where the model annotation guidelines (e.g. Li et al., 2020c; Du and
needs to predict spans and their values, given utter- Cardie, 2020).
ances from new, unsupervised domains. Extracting PLMs fine-tuned on the SQuAD 2.0 dataset (Ra-
slot-filler spans from utterances with a QA model jpurkar et al., 2018) or on QAMR are particularly
improved the performance, compared to a direct useful to initialize QA-style prompt-based learn-
encoding of the slot descriptions. ing methods.9 With the advent of web-scale QA
Lastly, Gao et al. (2019) formulated the dialogue datasets (Huber et al., 2021), QA-infused PLMs
state tracking task that aims to the estimate of the may provide significantly richer representation, en-
current belief state of a dialog given all the pre- abling a wider range of applications.
ceding conversation, as a QA problem. The pro- 3.3.2 Textual Entailment as Proxy Task
posed system uses a simple attention-based neural
Textual Entailment is a popular proxy for classi-
network to point to the slot values within the con-
fication tasks (Yin et al., 2019), as these models
versation. This direction was pursued by Gao et al.
have shown a striking ability to perform few-shot
(2020b) who also included a multiple-choice set-
learning. Wang et al. (2021b) hypothesizes that
ting, where several candidate values for each slot in
this phenomenon might be because the entailment
the question are given. The latter setting was also
task is a true language understanding task; a model
investigated by Zhou and Small (2020) who fur-
that performs entailment well is likely to succeed
ther improved the results. Namazifar et al. (2020)
on similarly-framed tasks. An example of textual
used this approach to address language understand-
entailment as a proxy for emotion classification is
ing problems in the dialogue context, experiment-
shown in Figure 3, while an example of its use for
ing on ATIS (Airline Travel Information Systems,
topic detection is shown in Table 4.
Hemphill et al., 1990) and on the Restaurants-8k
For entailment prompting, developers define a
dataset (Coope et al., 2020).
template that describes the task, and create a natural
QA Task Design. Questions are typically gener- language version (“verbalization”) of each poten-
ated via hand-crafted templates derived from the tial label. Multiple hypotheses for entailment are
task-specific ontologies. Some of the works intro- produced by inserting the potential labels into the
duce contextualization, integrating relevant words template. The inference is performed by selecting
from the text into the question. For example, in the most probable candidate hypothesis given the
argument extraction, the question can include the input. Some recent works also make use of multi-
trigger extracted from the text (e.g Liu et al., 2020a; ple verbalizations for each label to boost the system
Lyu et al., 2021) or another argument that was pre- performance (Sainz and Rigau, 2021; Sainz et al.,
viously identified (Li et al., 2020a) (see the Event 2021).
Extraction row in Table 4). Neural based question Sainz et al. (2021) also proposed an approach
generation models can also improve the quality of to guiding the “art” that is prompt crafting more
the question, as in Liu et al. (2020a), where mono- towards a “science”: the authors fine-tune a model
lingual unsupervised machine translation (Lample on Textual Entailment data and use the model’s
et al., 2018) is used to generate the part of the ques- probability of a prompt given the template, applied
tion that does not depend on the template, trans- 9
Fine-tuning on a PLM on QAMR corresponds to the p-
lating a descriptive statement into a question-style QuASE representation presented in He et al. (2020).
17
on the guideline example(s), to measure the quality ers to Section 2 for more detailed discussion of
of manually designed prompts. this “pre-train then fine-tune” approach. In this
Obamuyide and Vlachos (2018) reformulated section, we focus on tasks that are not traditionally
relation extraction as a textual entailment task. text generation tasks.
This approach has been pursued in the context of
PLMs by Sainz et al. (2021). Reformulating NLP Tasks as Text Generation
Roughly equivalent to textual entailment is Problems Pre-trained from large corpora, PLMs
Yes/No Question Answering (Clark et al., 2019a) demonstrate an extraordinary ability to generate
where a model is asked about the veracity of some text. PLMs also capture rich knowledge that could
fact given a passage. It has also been used as a be used for many NLP tasks and show strong per-
proxy task for text classification by Zhong et al. formance on learning new patterns via fine-tuning.
(2021a). These factors lead to the hypothesis that many NLP
PLMs needs to be fine-tuned to solve the textual tasks can be reformulated as text generation prob-
entailment task. They are commonly fine-tuned on lems. In particular, given an NLP task with an input
MNLI (Williams et al., 2018), but other datasets text x, this approach first attempts to design an out-
such as SNLI (Bowman et al., 2015), FEVER put sequence y that includes information about the
(Thorne et al., 2018), ANLI (Nie et al., 2020) or desired labels for x (e.g. markers). Then, a PLM
XNLI (Conneau et al., 2018) are also used. In ad- directly generates y, conditioning on the input x,
dition, data from different tasks can be used when modeling P (y|x). In this formulation, the desired
framed properly (Zhong et al., 2021a). labels/outputs for the task on x need to be retrieved
unambiguously from y, requiring y to be generated
4 Paradigm 3: NLP as Text Generation in a valid format by the design of the reformulated
task. In addition to the label information, evidence
The success of generative Transformer-based useful for providing context can also be incorpo-
PLMs10 such as GPT, BART, and T5 has recently rated into the formulation of y to aid the generation
sparked interest in leveraging generative PLMs to process. To train the PLMs, the original training
solve various non-generative NLP tasks. These data of the NLP task is first converted into pairs
tasks include, but are not limited to, traditional dis- (x, y) following the designed format. The PLMs
criminative tasks such as classification and struc- are usually fine-tuned with such pairs using the
ture prediction. For example, Figure 4 illustrates standard maximum likelihood loss.
this “text-to-text” approach as described in Raf- There are a few advantages of this approach.
fel et al. (2020). Instead of using traditional dis- First, in this formulation, a unified text-to-
criminative models for NLP tasks, these tasks are text/seq2seq framework can be used to solve differ-
reformulated as text generation problems so that ent NLP tasks via encoder-decoder architectures,
they can be directly solved with generative PLMs. thus facilitating multi-task learning and transfer
The generated output sequences usually include the learning across tasks of different natures (Raffel
desired labels or other auxiliary information for the et al., 2020). Second, the direct generation of la-
given task, enabling accurate reconstruction of the bels in output sequences allows the PLMs to exploit
expected class labels (i.e. to avoid ambiguities in the semantics of the labels to improve the perfor-
mapping) and facilitating the generation/decoding mance and data efficiency, a benefit that cannot be
process (i.e. to provide sufficient context for pre- achieved in discriminative models (Paolini et al.,
dictions). 2021). Finally, when adapting to structure predic-
It is worth noting that some NLP tasks are al- tion problems, PLM-based models can naturally
ready text generation tasks. Therefore, a straight- capture the inter-dependencies between prediction
forward strategy for those tasks is to fine-tune a steps/tasks in the modeling process to further im-
generative PLM using task-specific training data to prove the performance (Athiwaratkun et al., 2020).
perform the specific tasks of interest. Examples in- As such, the formation of the output sequence
clude Machine Translation (Cooper Stickland et al., y for an input x is critical for the performance
2021), text summarization (Lewis et al., 2020), text of the PLM-based methods. Existing works tend
style transfer (Lai et al., 2021), etc. We refer read- to customize such output sequences for specific
10
In this section and next, we use the term PLM to refer to NLP tasks to better capture the nature of the tasks.
a generative PLM. Therefore, in the rest of this section, we group
18
Figure 4: An illustration of T5 (Raffel et al., 2020) text-to-text text generation approach for Machine Translation,
linguistic acceptability, text semantic similarity and summarizing tasks. Figure source: Raffel et al. (2020).
prior works according to their strategies in design- second text span might be annotated with both the
ing the output sequences to solve NLP tasks with relation label and an indicator of the first text span.
generative models, and discuss their representative For example, for the joint entity and relation
techniques in each subsection. Table 5 provides a extraction task, the input sentence x can be trans-
brief summary. formed into the label-augmented output sequence
y, where (1) the square brackets indicate token
4.1 Generating Label-Augmented Texts spans for entity mentions; (2) person and book
are the corresponding entity type labels; and
In this strategy, the output sequence y copies (3) author=Tolkien indicates the author re-
the input text x and augments it with additional lation between Tolkien and The Lord of
markers that can be decoded into desired label an- the Rings:
notations for x for a given NLP task. The repetition x = Tolkien’s epic novel The
of the words from the input text aims to provide Lord of the Rings was
explicit context to reduce ambiguity for the genera- published in 1954-1955.
tion process (Paolini et al., 2021). This strategy is
y = [Tolkien|person]’s epic
often applied to structure prediction tasks that aim
novel [The Lord of the
to jointly extract the text spans of interest and their
Rings|book|author=Tolkien]
relations or dependencies in an input text.
was published in 1954-1955.
Athiwaratkun et al. (2020) explores the idea of
label-augmented text generation for sequence la- In order to transform the generated label-
beling problems, e.g. slot filling (identifying spans augmented texts into desired annotations, Paolini
that define the left or right “slot” of a relationship) et al. (2021) uses dynamic programming to match
and Named Entity Recognition (NER). Given an the generated output sequence and the input text,
input sentence, the output sequence is formed by searching for the closest entity mention that exactly
marking the token sequences for the slots or entity matches the predicted tail entity and discarding in-
types of interest, for instance with square brack- valid entity/relation types. Similarly, Zhang et al.
ets or another identifier. The corresponding labels (2021a) utilize label-augmented text generation for
are then introduced immediately after the token different variations of aspect-based sentiment anal-
sequences, within the brackets, separated by the ysis (ABSA), including aspect opinion pair extrac-
token by a bar “|”. The encoder-decoder PLM T5 tion, unified ABSA, aspect sentiment triplet extrac-
is used to generate label-augmented texts. Paolini tion, and target aspect sentiment detection. Zhang
et al. (2021) extends this idea to other structure et al. (2021a), also propose a normalization pre-
prediction tasks, including joint entity and relation diction mechanism: if a generated token does not
extraction, relation classification, semantic role la- belong to the original sentence or set of expected
beling (SRL), event extraction, coreference reso- labels, the closest word from the input sentence
lution, and dialogue state tracking. To encode a using the Levenshtein distance is used instead.
relation between two text spans the input text, the Due to the unified text-to-text formulation, label-
19
Output Example
Work Task
Type Input Output
Tolkien’s epic novel The Lord of the [ Tolkien | person ]‘s epic novel [ The Lord of the Rings
Joint Entity and Relation Extraction
Rings was published in 1954-1955. | book | author = Tolkien ] was published in 1954-1955
Born in Bologna, Orlandi was a student of the famous Italian relationship between [ Carmen Melis ] and
Relation Classification [ soprano ] and voice teacher [ Carmen Melis ] in Milan. [ soprano ] = voice type
The relationship between [ Carmen Melis ] and [ soprano ] is
The luxury auto maker last year [ sold ] [ The luxury auto maker | subject ] [ last year | temporal ]
Paolini et al. (2021) Semantic Role Labeling
1,214 cars in the U.S. sold [ 1,214 cars | object ] [ in the U.S. | location ]
Event Extraction Two soldiers were attacked and injured yesterday Two soldiers were [ attacked | attack ] and [ injured | injury ] yester-
Label- day
augmented Barack Obama nominated Hillary Rodham Clinton [ Barack Obama ] nominated [ Hillary Rodham Clinton ] as [ his |
Coreference Resolution
Text as his secretary of state on Monday. Barack Obama ] [ secretary of state | Hillary Rodham Clinton ] on
Monday
[ user ] : I am looking for a cheap place [ belief ] hotel price range cheap, hotel
Dialogue State Tracking
to stay [ agent ] : How long? [ user ] : Two type hotel, duration two [ belief ]
Athiwaratkun et al. Slot Filling Add Kent James to the Disney soundtrack (( AddToPlaylist )) Add [ Kent James | artist ] to the [ Disney |
(2020) playlist ]
Named Entity Recognition He is John Wethy from NBC News He is [ John Wethy | person ] from [ NBC News | org ]
[Salads | fantastic] were fantastic,
Aspect Opinion Pair Extraction Salads were fantastic, our server was also very helpful.
our [server | helpful] was also very helpful.
Zhang et al. (2021a) The [Unibody construction | positive | solid, sleek,
Aspect Sentiment Triplet Extraction The Unibody construction is solid, sleek and beautiful.
beautiful] is solid, sleek and beautiful.
Target Aspect Sentiment Detection The pizza was cold. The [pizza | food quality | negative] was cold.
Yan et al. (2021b) Named Entity Recognition have muscle pain and fatigue 237256
Aspect Term Extraction 1, 2, 12, 12
Generating The wine list is interesting and
Opinion Term Extraction 4, 4, 7, 8, 14, 14
Word Yan et al. (2021a) has good values , but the service
Aspect-level Sentiment Classification 1, 2 , Positive
Indices is dreadful
Aspect-oriented Opinion Extraction 1, 2, 4, 4, 7, 8
PlaySongIntent SongName( @pt r3 @pt r4 @pt r5)
Rongali et al. (2020) Slot Filling play the song don’t stop believin by journey
SongName ArtistName( @pt r7 )ArtistName
Generating Wang et al. (2021a) Closed-book QA What is Southern California often abbreviated as? Southern California, often abbreviated SoCal, is . . . ANSWER SoCal
Answers Hsu et al. (2021) Answer Selection How a water pump works? A water pump is a device that moves fluids by mechanical action.
[CLS] Attack, Bombing, Arson, . . . [SEP T] (Document
Du et al. (2021b) Event Extraction [CLS] Attack -T1 REEs- [SEP T] Bombing -T2 REEs- [SEP T]
Filling tokens): Several attacks were carried out in La Paz . . . [SEP]
Templates Elliott testified that on April 15, McVeigh came into the body Elliott bought, sold or traded truck to McVeigh in exchange for
Li et al. (2021c) Event Argument Extraction
-tgr- reserved -tgr- the truck, to be picked up at 4pm two days $280.32 for the benefit of -arg- at body shop place
later shop and
Structure- Ren et al. (2021) Joint Entity and Relation Extraction He was captured in Baghdad late Monday night “He” Type PER [SEP] “Baghdad” Type GPE PHYS “He”
linearized ((Transport returned (Artifact The man) (Destination Los Angeles)
Lu et al. (2021b) Event Extraction The man returned to Los Angeles from Mexico
Texts (Origin Mexico))
Nogueira dos <bos>Ice formations in the Titlis glacier cave
Answer Selection 0.5
Ranking Santos et al. (2020) <boq>How are glacier cave formed <eoq>
Input- How are glacier cave formed [Q] A glacier
Nogueira et al. (2020) Document Retrieval True
output cave is a cave formed within the ice of a glacier [D]
Pairs De Cao et al. (2021) Entity Retrieval Superman saved [START] Metropolis [END] Metropolis (comics) | Metropolis (1927 film)
Cui et al. (2021) Named Entity Recognition ACL will be held in Bangkok Bangkok is a location
Table 5: A summary of methods reformulating NLP task as a generation task solved by PLMs.
augmented text generation allows multi-task learn- labels within y. A few examples are included in
ing where a single generative model can be trained Table 5 in the “Generating Word Indices” rows.
to simultaneously perform multiple tasks of differ- Yan et al. (2021b) explores an index generation
ent natures. Paolini et al. (2021) and Athiwaratkun idea for NER that can naturally handle different
et al. (2020) show that learning from multiple settings, e.g. flat, nested, and discontinuous NER.
tasks with a single model can improve the perfor- Given the input sequence x = [x1 , x2 , . . . , xn ],
mance on the individual tasks. Furthermore, label- the output sequence y is formed via the indices:
augmented text generation also shows impressive y = [s11 , e11 , . . . , s1k1 , e1k1 , t1 , . . . , si1 , ei1 , . . . ,
performance in few-shot learning settings (Paolini siki , eiki , ti ] where s and e indicates the start
et al., 2021), improving the data efficiency. and end indexes of a span. The spans for the
i-th name in x are represented by the tuple
4.2 Generating Word Indices [si1 , ei1 , . . . , siki , eiki , ti ] where ti is the index of
For many text understanding problems (e.g. span the entity type and ki is the number of text spans for
tagging problems such as NER), the generative the i-th name (a name can have multiple spans due
PLM must not generate words that are not in the to the consideration of discontinuous names). As
input text, other than markers or labels as shown in such, sij and eij should be between 1 and n while
the example in Section 4.1. Restricting the PLMs to the entity types can be indexed from n + 1 (i.e.,
consider only words in the input text as candidates ti > n). To compute the hidden vectors at decod-
at decoding (text generation) time enforces this ing time, the representations for the span indices
constraint. can be obtained from the representations of the cor-
An alternative approach is to directly generate responding words in the input sentence x (i.e., via
indices of the words of interest in the input text. pointer networks (Vinyals et al., 2015)). BART is
Given the input x, the output sequence y provides used as the base model for the index generation for
a sequence of index numbers referring to the po- NER.
sitions of words in x. Label indices encode class Similarly, Yan et al. (2021a) generates indices
20
of the spans of interest for variations of the aspect- templates in the form of who did what to whom
based sentiment analysis (ABSA) task, including where and when.
aspect term extraction, opinion term extraction, A template defines the appropriate relationship
aspect-level sentiment classification and aspect- and order for the spans and labels for generation,
oriented opinion extraction. Finally, casting a prob- forming the output sequence y. Du et al. (2021b)
lem into an index generation task is also proposed explores the template filling idea for an IE task:
for semantic parsing (i.e. filling slots) (Rongali given a document, a model must identify event
et al., 2020). The output sequence in this work templates/types (via trigger words) and entity men-
starts with the intent, followed by slot names and tion fillers for the argument roles. A sequence-
the index sequences of the words in the input for to-sequence model for template filling takes the
the slots. At decoding time, each step produces possible event types concatenated with words in
a distribution over the word indexes in the input the input document x as the input, and outputs a
sentence (as a pointer network) and the vocabulary sequence of tuples. Each tuple corresponds to a
for slots and intents in the datasets. detected event template, starting with an event type
and followed by the text span fillers for the roles in
4.3 Generating Answers the input document (following an order). The roles
This strategy is designed mainly for the QA task. with no fillers are associated with null. (Zhang
The basic idea is to fine-tune PLMs to generate an- et al., 2021a) also examines a similar approach of
swers for the QA problems of interest. Wang et al. tuple generation for ABSA.
(2021a) use BART for closed-book QA that aims to The template filling methods can also introduce
directly provide answers for input questions. They additional information into the templates to aid the
show that BART struggles on a version of SQuAD label generation process, such as natural descrip-
for closed-book QA where the test and training tions or definitions of the labels. In particular, Li
data do not have much question and answer over- et al. (2021c) pursue a general template filling ap-
lap. It also shows that BART cannot remember proach for document-level event argument extrac-
knowledge from the fine-tuning data if there are tion: given an event trigger in an input document,
many training passages for fine-tuning. Sugges- find entity mentions to fill in the roles for the event.
tions to address those issues include decoupling the A conditional generative model (e.g. BART) is
knowledge memorization and QA fine-tuning, and employed for argument extraction where the input
forcing the model to recall relevant knowledge in (the condition) to the model is created by combin-
the answer generation step. ing an unfilled template and the document context.
The template is essentially a sentence describing
Hsu et al. (2021) applies answer generation to
the event type augmented with placeholders for
the problem of answer selection, in which the sys-
argument role fillers. The output sequence y is a
tem must choose the correct answer from a pro-
filled template where placeholders are replaced by
vided candidate set (it is also provided the question).
concrete arguments (entity mentions). To avoid
Instead of training an answer selector (Han et al.,
entity type mismatch for arguments, the templates
2021a), Hsu et al. (2021) uses answer generation
in the inputs are also appended with sentences to
through fine-tuning PLMs such as T5 and BART,
indicate entity types for arguments (e.g. arg1 is a
which consume the input question and the top an-
person) that can be used to re-rank the output se-
swer candidates, then generate an answer for the
quences to follow the type constraints. Below is an
question. To prepare training data for fine-tuning,
example input x in which a template over a list of
the output answers might come from human anno-
event arguments arg1 , ..., arg6 and the document
tators or be directly inherited from the provided
text DOC TEXT are concatenated, and output y,
correct answer (i.e. the correct answer will be re-
in which the underlined text spans are fillers from
moved from the input for the generative models
DOC TEXT) from Li et al. (2021c):
and maybe replaced by another answer candidate).
x = hsi harg1 i bought, sold,
4.4 Filling templates or traded harg3 i to harg2 i in
For many extraction tasks, the output are spans or- exchange for harg4 i for the
ganized into one or several templates. For example, benefit of harg5 i at harg6 i
event extraction tasks require a system to extract
21
place. hsi h/si DOC TEXT h/si spans as leaves. The labeled tree is transformed
y = Elliott bought, sold or into the output sequence y by depth-first traver-
traded truck to McVeigh in sal where T5 is used to perform the conditional
exchange for 280.32 for the generation of y from x. To improve the model, a
benefit of hargi at body shop trie-based constrained decoding procedure (Chen
place. et al., 2020a; De Cao et al., 2021) is introduced to
ensure the generation of valid event structures. A
4.5 Generating Structure-Linearized Texts trie (prefix-tree) determines possible candidates for
the next generation step given the previously gen-
Structure prediction problems in NLP typically re-
erated tokens to guarantee valid output sequences.
quire multiple prediction outputs for an input text
Lu et al. (2021b) also report the effectiveness of
x that are interconnected to form a single struc-
the generation-based model for extending models
ture that represents the input. To cast structure
to extract new event types.
prediction tasks as text generation problems, one
approach involves linearizing the output structure
4.6 Ranking Input-Output Pairs
to serve as the output sequence y. For example, tak-
ing x as input, TEXT2EVENT (Lu et al., 2021b) Some NLP tasks require choosing the best response
directly generates the event structures y: from among many: answer selection in multiple
x = The man returned to Los choice-sytle QA, information retrieval, and certain
Angeles from Mexico following kinds of entity retrieval all provide a set of can-
his capture Tuesday by bounty didate answers to a posed query from which the
hunters. system selects the best one. Typically, a system will
rank the candidates in relation to the input query,
y = ((Transport returned a task at which PLMs can excel. The idea has its
(Artifact The man) roots in the classical literature on probabilistic mod-
(Destination Los Angeles) els for information retrieval that rank documents
(Origin Mexico)) (Arrest-Jail using language models (Ponte and Croft, 1998; Laf-
capture (Person The man) ferty and Zhai, 2001). Given an input query, a can-
(Time Tuesday) (Agent bounty didate document is scored in two steps: (i) training
hunters)) a language model on the candidate document, and
Graph traversal algorithms are often used to ac- (ii) computing the likelihood of generating the in-
complish the linearization in this approach. Ren put query from that language model, which serves
et al. (2021) study structure-linearization for joint as the candidate’s ranking score.
entity and relation extraction (Li et al., 2014; Miwa We now see the use of PLMs to per-
and Bansal, 2016). The main idea is to construct form generation-based ranking for selection.
an information graph for each input sentence to Nogueira dos Santos et al. (2020) apply the idea for
capture entity mentions, their entity types, and answer selection by fine-tuning generative models
relations. Depth or breath first traversal can be (GPT-2 or BART) over hanswer, questioni pairs,
used for graph linearization for y. To solve the thus learning to generate questions given correct
sequence-to-sequence problem for pairs of hx, yi, answer passages. The simplest approach is to
Ren et al. (2021) linearize the information graph fine-tune the models over only the positive pairs.
to an alternating sequence of nodes and edge types Nogueira dos Santos et al. (2020) also explore fine-
(given depth/breath first traversal), and directly gen- tuning with negative pairs using an unlikelihood
erate such sequences via a hybrid span decoder that objective or ranking-based objective (e.g. the hinge
decodes both the spans and the types recurrently. loss). At inference time, the ranking score for an
For event extraction with joint extraction of event input passage is obtained via the likelihood of the
triggers and arguments (Li et al., 2014; Nguyen fine-tuned PLM over the input question condition-
et al., 2016), a structure-linearization and text gen- ing on that passage.
eration approach comes from Lu et al. (2021b). Nogueira et al. (2020) approach the document
The authors first build a labeled tree to capture relevance ranking problem in a similar way. The pa-
the event types and argument roles in the sentence per concatenates the input query and each candidate
(i.e. event schema), with trigger and argument text document and feeds them as an input/condition for
22
a fine-tuned T5 model. To fine-tune T5, the model The studies presented below discuss, for various
is asked to generate “True” or “False” as the output downstream NLP tasks: approaches for fine-tuning
sequence, indicating the document’s relevance to PLMs to ensure they capture the key characteristics
the query. The probability of generating “True” is of the task when performing data generation; ap-
used as the ranking score for the candidate. propriate reformulation of the original training data
De Cao et al. (2021) address the entity retrieval for PLM fine-tuning and generation; and filtering
problem: given a set of Wikipedia articles repre- the new data for noise introduced by the generation
senting entities, return the entity that is most rel- process.
evant to a textual input source x. Each entity is Second, we discuss the use of auxiliary data
represented by its textual representation (e.g. the generated by PLMs to shed light on interesting
title of its Wikipedia article), which will be used aspects of NLP models. This approach plays a role
as the output sequence y for the generative models. in machine learning explainability by providing
BART is fine-tuned to rank the entities using the generations such as counterexamples, clarifying
generation likelihood P (y|x). Cui et al. (2021) ex- questions, context for answers, inference rules, and
plore generation-based ranking for NER, especially other insight-rich sequences.
in few-shot and cross-domain few-shot settings.
Given an input sentence and a text span, a template 5.1 Augmenting NLP Models with
is formed by concatenating the words in the span Automatically Generated Data
and an expression of type “is a entity type entity”. Traditional approaches to data augmentation, in-
The original sentence and the template serve as an cluding generation via semi-supervised learning on
input-output pair in sequence-to-sequence models. large unlabeled data sets and synthesis with back-
BART is then employed to score this pair (using the translation or synonymous word replacement (Feng
probability of the template output produced by the et al., 2021; Chen et al., 2021a) were shown to be ef-
decoder of BART). For each span, the entity type fective for increasing NLP models’ accuracy and/or
corresponding to the template with highest score coverage. Newer studies show that PLMs can be
is selected. Original NER training data is used to also used as an effective method for data augmen-
create gold standard templates to fine-tune BART. tation (Zhang et al., 2020a; Yang et al., 2020; Peng
In addition to question answering, other genera- et al., 2020; Kumar et al., 2020; Anaby-Tavor et al.,
tive tasks have been shown to benefit from PLMs. 2020), requiring no significant change to the model
For instance, semantic parsing, generating a struc- architecture. The fluency of PLM text generations
ture representing the semantics of the sentence, is stand in contrast to the outcomes of traditional ap-
explored in a recent work by Shin et al. (2021). proaches that may produce less natural samples. As
Authors show that by reformulating the output of discussed in previous sections, the massive amount
PLMs the generated natural language can be used of linguistic knowledge accumulated by the PLM
to recover the semantic structure of the input text. allows for adaptation to many domains and tasks,
They use GPT-3 in the experiments. including those with very limited labeled data. The
vast knowledge may also produce a greater vari-
5 Data Generation via PLM ety of new examples, further improving the NLP
models trained on them.
In addition to using PLMs to perform NLP tasks We organize the discussion of data augmentation
directly, PLMs can be used to generate data that methods according to the NLP tasks they support.
can be used to enhance the performance of NLP
systems in two ways. Note that these data gener- 5.1.1 Information Extraction (IE)
ation approaches are complementary to the three Prior works explored synthetic data generation with
paradigms of PLM-for-NLP discussed in previous PLMs (Madaan et al., 2020; Bosselut et al., 2019)
sections. for a variety of IE tasks.
First, data generated by PLMs can be combined Veyseh et al. (2021a) and Veyseh et al. (2021b)
with original training data to improve NLP mod- use GPT-2 to produce synthetic labeled data for
els where training data is too sparse. Typically, event detection. Sentences in existing training
this is applied to create new labeled data to in- datasets are augmented with markers to indicate
crease diversity, enrich the models, and otherwise positions of event trigger words. The resulting la-
alleviate common limitations of hand-labeled data. beled sentences are used to fine-tune GPT-2 us-
23
ing the standard autoregressive next word pre- instance for QA models. To mitigate the noise in
diction (NWP) objective. Veyseh et al. (2021a) the generated data, Alberti et al. (2019) present a
shows that the fine-tuned GPT-2 model can gener- round trip consistency approach where a second
ate label-augmented data for different domains (e.g. generative model is trained to take the input pas-
newswire, cybersecurity); however, the generated sage C and generated question Q from the prior
data might include some noise, for instance, incor- step to produce an answer A0 . The tuple (C, Q, A)
rect grammar, meaningless sentences, or incorrect is only retained as new training data if A0 == A.
annotations. To minimize the impact of the noisy Following a similar principle, Shakeri et al.
generated examples and maximize the benefits of (2020) explore synthetic data generation for cross-
the generated data, Veyseh et al. (2021a) and Vey- domain QA where models trained on a source do-
seh et al. (2021b) present a student-teacher network main (typically SQuAD) are evaluated on datasets
framework: the teacher network is trained on the from a different target domain. The paper aims to
original labeled data to obtain anchor knowledge, generate QA pairs in the target domain and com-
while the student is trained over the combination bine them with the source-domain training data to
of original and synthetic data, with constraints in- train improved QA models. The data generation
troduced to enforce consistency with the teacher’s model is also trained on the source domain dataset
learned anchor knowledge. The framework leads SQuAD using BART and GPT-2. Starting with a
to significant performance improvement over dif- passage as the context, the generative models di-
ferent datasets for event detection. rectly generate QA pairs. Generated QA pairs are
Guo and Roth (2021) employ GPT-2 to generate filtered by the likelihood scores of the generative
synthetic labeled data for cross-lingual NER fol- models to reduce noise.
lowing the annotation projection approach: training The data generation idea is extended to multi-
data in a source language is translated and projected hop QA that requires combining disjoint pieces of
into a target language to train models. To project an- evidence to answer a question. In particular, Pan
notation, a training sentence in the source language et al. (2021b) aim to generate human-like multi-hop
is first translated into the target language using question–answer pairs to train QA models. The
word-to-word translation (via a dictionary). GPT-2 model consists of three components: operators,
is then fine-tuned to generate complete sentences reasoning graphs, and question filtration. Opera-
from the important words in target languages. A tors are atomic operations that are implemented by
hard-constrained generation mechanism is also en- rules or off-the-shelf pretrained models to retrieve,
coded into the decoding process of GPT-2 to ensure generate, or fuse relevant information from input
the appearance of the named entities in the origi- contexts. Approaches to fusing relevant informa-
nal source sentence in the automatically generated tion from across contexts include: fine-tuning a
sentences. T5 model on SQuAD to generate single-hop ques-
Synthetic data generation with GPT-2 is also ex- tions; generating descriptions of table entities with
plored for relation extraction in Papanikolaou and GPT-TabGen (Chen et al., 2020b); and combining
Pierleoni (2020). This paper fine-tunes GPT-2 over single-hop questions with sentences about the same
labeled examples of the same relation type, where entities to produce multi-hop questions via filling
each sentence in the training data is marked with in masked tokens of designed templates. Reason-
the two entity mentions in the corresponding rela- ing graphs then define different types of reasoning
tion. The fine-tuned model for each relation type chains for multi-hop QA using the operators as
is then leveraged to produce new training instances building blocks. Training QA pairs are generated
for that relation. by executing the reasoning graphs, which generate
output texts. Finally, question filtration removes
5.1.2 Question Answering (QA) irrelevant and unnatural QA pairs to produce the
Given an input paragraph C and a sampled ex- final generated training set for multi-hop QA. The
tractive short answer A in C, Alberti et al. (2019) filtration is done by choosing the samples ranked
attempts to generate a question Q using a sequence- as most fluent by GPT-2, and paraphrasing each
to-sequence Transformer (with BERT as its en- generated question using BART.
coder). The triple, consisting of the input para-
graph, the generated question, and the sampled 5.1.3 Sentiment Analysis (SA)
answer (C, Q, A), can be used as a new training Yu et al. (2021a) applies data augmentation for
24
aspect-based SA in the unsupervised domain adap- tity is selected in the original evidence in the first
tation setting, aiming to transform labeled datasets step of the process. To produce a “refute” claim,
in a source domain to a new target domain. The the work replaces the original answer with another
main approach involves two steps. In the first step entity in the generation process. Finally, to create a
of domain generalization, domain-specific words “not-enough-evidence” claim, the paper expands the
and phrases in the labeled source data and un- original evidence to include other paragraphs in the
labeled target data are identified and masked in same document and produce claims for some en-
the inputs. Opinion words for the source domain tity in the extended paragraph. Experiments show
and target-specific terms and opinion words are competitive results when the augmented data is
retrieved via sentiment lexicon and bootstrapping combined with few or even no human-labeled ex-
methods using relations in dependency trees. The amples for model training.
target-specific terms in the unlabeled data will be
masked to fine-tune BERT. In the second step of 5.1.5 Document Classification
domain specification, the source-specific terms in A typical approach to generating synthetic data
the source data are masked (thus producing domain- for text classification is to build a conditional gen-
independent texts) and sent into the fine-tuned erative model for each class by fine-tuning with
BERT to produce labeled sentences in the target do- labeled data from that class. While these models
main. Here, some constraints based on dictionaries can be fine-tuned with the next word prediction ob-
are necessary to ensure that the infilled words are jective with generative PLMs such as GPT-2, Liu
terms or opinion words with the same sentiment et al. (2020b) use reinforcement learning to train
polarity. The generated data can be used indepen- generative models to augment text classification
dently or combined with original source training labeled data. The rewards for training are based on
data to train a SA model for the target domain. the similarity between the generated tokens and a
Li et al. (2020b) use PLMs to generate synthetic salient lexicon of the target class computed via top
data for aspect term extraction (cast as a sequence frequency-based salient words, and the divergence
labeling problem). To fine-tune PLMs with the between the conditional and unconditional models.
sequence-to-sequence framework for this purpose, Liu et al. (2020b) demonstrate the effectiveness of
the input includes a masked sentence from a train- using the automatically generated data in multiple
ing dataset and the corresponding label sequence text classification problems and datasets, including
while the output are the masked tokens in the input. sentiment analysis and offense detection.
The fine-tuned PLMs are then exploited to generate
new possibilities for the masked tokens that can be 5.2 Generating Auxiliary Data to Improve
injected into the masked input, using the original Different Aspects of NLP Models
label sequence to obtain synthetic labeled data to The following sections, again arranged by task, dis-
train models. cuss ways of using PLM-generated text to aid in
auxiliary tasks, helping developers or users under-
5.1.4 Fact Verification stand model strengths and weaknesses or decision-
Fact verification aims to predict whether a given making characteristics.
claim is supported, denied, or unresolved based on
the given evidence. Automatically generated texts 5.2.1 Explaining Models’ Decisions
can be used to generate claim-evidence pairs for Despite the impressive performance of deep learn-
each label category. To this end, Pan et al. (2021a) ing models for various NLP tasks, a remaining
employ a two-step approach to generate synthetic challenge to widespread adoption is the lack of
data for fact verification. In the first step of ques- explanations for the models’ decisions. This hin-
tion generation, given the evidence and an answer, ders the development and debugging process, as
a BART model, fine-tuned on the SQuAD dataset well as user trust. This is especially true for appli-
using the similar input-output format, generates a cation domains such as healthcare, security, and
question for that answer. Next, a question-to-claim online education. As such, a considerable number
model is employed to take the question and answer of approaches have been proposed for explaining
as inputs and generate a claim (also using a BART deep learning models’ behavior, including model-
model fine-tuned on SQuAD). To produce hclaim, intrinsic (Ribeiro et al., 2016; Lundberg and Lee,
evidencei pairs with the “support” relation, an en- 2017; Chen et al., 2018) and model-agnostic ap-
25
proaches (Park et al., 2018; Kim et al., 2018; Ling the generated counterexamples can be helpful to
et al., 2017). While model-intrinsic explanations improve the performance of the downstream mod-
expose internal model state (e.g. feature impor- els, e.g. for natural language inference, duplicate
tance or attention scores), in model-agnostic (post- question detection, and sentiment analysis.
hoc) methods, explanations are generated via the Other research is informing the task of natural
model predictions without inspecting the internal language explanation generation, where the goal is
state. Generative models are often applied for post- to expose the rationale behind the model decisions
hoc explanations, aiming to obtain either counterex- in automatically generated natural language text.
amples (Kim et al., 2016; Wachter et al., 2018; Wu Any approach must critically require that the gen-
et al., 2021a) or natural language texts (Camburu erated response is faithful to the model behavior.
et al., 2018; Kumar and Talukdar, 2020; Chen et al., To this end, Kumar and Talukdar (2020) propose to
2021c) for explaining purposes. first generate the explanations, and then employ the
explanations to obtain the final model predictions.
Generating counterexamples can shed light They use natural language inference as the task re-
on the decision boundaries of the models (i.e. quiring explanations. Label-specific GPT-2 models
explaining when a model changes its decision), are fine-tuned over concatenations of correspond-
thus improving intepretability. To this end, the ing premises, hypotheses, and human-provided ex-
generated counterexamples should be close to the planations, so that at inference, the model generates
decision boundaries so that small modifications an explanation based on premise and hypothesis.
result in changing the model predictions. Tradi- Next, the explanations together with the premise
tionally, heuristic rules applied to the original and the hypothesis are consumed by an explanation
inputs create likely counterexamples (Wachter processor model (e.g. RoBERTa) to select the most
et al., 2018; Ribeiro et al., 2018; Iyyer et al., 2018; likely label. This process obtains a more faithful
Li et al., 2021a). PLMs have been leveraged explanation for the label choice, compared to tradi-
to generate more diverse examples for better tional prediction-first approaches (Camburu et al.,
evaluation (Madaan et al., 2021b; Wu et al., 2018). However, this approach does not provide
2021a; Ross et al., 2021). In particular, Wu et al. explanations that reference non-selected labels. To
(2021a) proposes a method based on GPT-2 to address the question of why other labels are not cho-
generate counterfactuals that are close to the sen, Chen et al. (2021c) exploit counterexamples,
original sentences and entail specific relationships deriving them from original samples with heuristic
with the original, facilitating label induction (e.g. rules. The original samples and counterexamples
negation, insertion, shuffle). Concretely, an input are provided to GPT-2 to generate an explanation
sentence is concatenated with a relation label for the question “Why A not B”.
(e.g. negation) and a template consisting of the
special tokens [BLANK] to form the prompt for 5.2.2 Knowledge Extraction
GPT-2 model. For instance, for the sentence “It is Generative PLMs are pre-trained on massive text
great for kids” and the relation label “negate”, corpora containing a large amount of information
the following prompt is constructed: “It is about entities and commonsense knowledge. As
great for kids. [negation] It is such, PLMs might directly be used to elicit knowl-
[BLANK] great for [BLANK]. [SEP]”. edge required for downstream applications such
Next, the GPT-2 model generates answers as information extraction, sentiment analysis and
for the [BLANK] in the template (e.g. “not question answering. To this end, it is important
[ANSWER] children”, separated by the to properly prompt these models so their outputs
special token [ANSWER]). To fine-tune the GPT-2 contain the required information. Section 3.2 de-
model, non-parallel datasets (e.g. CommonGen, scribes the prompt design for knowledge extrac-
Natural Questions and SQuAD) are automatically tion/probing tasks, and in particular, the “Knowl-
processed to find the relations between pairs of edge Probing” subsection describes applications in
sentences and to construct the templates for each details. Here we focus on the text generation aspect
relation based on the obtained pairs. It is worth of knowledge extraction approaches.
noting that the sentences generated by GPT-2 Prior works can be categorized into two sub-
might have the same label as the original input categories. The first category involves prompting
sentence. In addition, Wu et al. (2021a) show that PLMs with partial knowledge via a prompt and
26
asking the models to complete the prompt. Specif- questions are later used to extract other candidate
ically, pre-defined templates can be designed and answers. Finally, the generated answer-question
filled with partial knowledge (e.g. the two entities pairs are ranked to select the top one for the am-
involved in a relation) and the generative PLMs can biguous QA problem. Min et al. (2020) show that
predict the missing words in the templates (e.g. the the process of generating auxiliary disambiguation
relation type between the two entities.) The tem- questions could further help the models to encode
plates can be fixed (Goswami et al., 2020) or they the interactions between the original input question
can be dynamically constructed by a pre-trained and the candidate answers.
model (Shin et al., 2020) (further details are in Sec- In another line of work, Mao et al. (2021) seek
tion 3.2). The second category instead proposes to generate clarification texts for input questions
to prompt the PLMs with full knowledge and ask to improve the retrieval quality in open-domain
the models to generate a natural language text to QA (answering factoid questions without a pre-
describe that knowledge. This task is known as specified domain). The most common approach
Data-to-Text (Kukich, 1983), and the goal is to for this problem involves a retriever-reader archi-
obtain a textual description of existing knowledge tecture (Chen et al., 2017), which first retrieves a
bases. The generated textual descriptions can be small subset of documents in the pool using the
used by downstream applications such as knowl- input question as the query and then analyzes the
edge probing (Petroni et al., 2019) or QA (Agarwal retrieved documents to extract (or generate) an an-
et al., 2021), among others. Agarwal et al. (2021) swer. To generate augmented texts for the input
introduce a model based on T5 to convert Wiki- question in the first retrieval component, Mao et al.
data knowledge graphs (with triples of relations be- (2021) fine-tune BART to consume the input ques-
tween two entities) into textual data. The proposed tion and attempt to produce the answer and the sen-
approach consists of three stages. First, create a tence or title of the paragraph containing the answer.
large but noisy training dataset using distant super- This method demonstrates superior performance
vision for relation extraction by aligning knowl- for both retrieval and end-to-end QA performance.
edge base (KB) triples to Wikipedia texts. Next, In addition to clarification information, PLMs
fine-tune T5 in stages, starting with the distantly can also be used to paraphrase questions to support
supervised dataset for better coverage, then moving QA models. Mass et al. (2020) explore the problem
on to a small clean dataset for less hallucination. of FAQ retrieval, retrieving the top QA pair given
The model learns to generate descriptive sentences a user query. Based on the returned QA pairs (q, a)
from KB triples. Last, build a filter for the gener- from a retrieval system, this work proposes an un-
ated texts based on semantic quality with respect supervised method to re-rank the pairs to improve
to the KB triples by scoring the concatenation of the performance. One of the ranking scores is a
input and output with BERT. matching score between the question p in the pair
5.2.3 Question Generation (q, a) with respect to the user question. A triple
network is trained over the tuples (p, q, q 0 ), where
While PLMs can be directly used for generating
q is a paraphrase of the question p while q 0 is ran-
answers for questions, they might be also help-
domly selected questions from other QA pairs. To
ful to support existing QA systems. Specifically,
this end, Mass et al. (2020) fine-tune GPT-2 over
PLMs can be employed to provide clarification for
the concatenations of the corresponding answers
downstream QA systems. The clarification can be
and questions in the FAQ. The fine-tuned GPT-2
realized in terms of question clarification when the
is then prompted with the answer a to produce a
question is ambiguous or it can be fulfilled by pro-
paraphrase q 0 for q in the ranking network.
viding more context. For instance, in Gao et al.
(2021b) and Min et al. (2020), multi-step question
5.2.4 Inference Rule Generation
generation approaches are proposed for ambiguous
QA in which the BART model is prompted with an For some applications, it is important to understand
ambiguous question and the top similar passages the process by which the final predictions of the
retrieved in a document to generate candidate an- models are obtained. These intermediate inference
swers. If multiple answers are generated, another rules provide are another form of model explana-
BART model is employed to generate a disambigua- tion and provide insights for improving model per-
tion question for each answer. The newly generated formance.
27
Paul and Frank (2021) exploit GPT-2 to perform models such as T5, with model fine-tuning to im-
narrative story completion: given a few sentences prove performance in several QA tasks.
of a story, the goal is to complete the story us- As independently trained models, PLMs are also
ing sentences that logically follow the narrative by no means mutually exclusive. For example,
in the given incomplete story. In an incremental ACE (Wang et al., 2021c) shows that combining
generation method, each step seeks to generate a multiple PLMs (e.g ELMo, BERT, mBERT, XLM-
contextualized inference rule conditioned on the R) yields further improvements over using a single
current incomplete story. To accomplish this, GPT- PLM for a range of NLP tasks. Investigation of
2 is fine-tuned on human annotation of story line the complementarity of different PLMs is a future
inferences. Next, given the current story and gener- research direction.
ated inference rule, a new sentence for the story is From another perspective, the design of the train-
generated (using another fine-tuned GPT-2 model). ing for MLMs has been driven by the results on the
By interspersing the inference rules, the storyline fine-tuning paradigm, but it is not clear whether an
generations should create a coherent story that fol- exploration of different training objectives could
lows logical connections and causal relationships lead to PLMs that are more effective when used
between events. with prompting or generation to solve NLP tasks.
Madaan et al. (2021a) employ T5 to generate in-
ference graphs for defeasible inference (Rudinger
How much unlabeled data is needed? While
et al., 2020). In this mode of reasoning, given a
PLMs are usually trained on billions of words,
premise, a hypothesis may be weakened or over-
some works have investigated what can be learned
turned in light of new evidence. As training in-
with less pre-training data. Zhang et al. (2021b),
ference graphs for this problem requires a large
experimenting on RoBERTa models trained on 1M,
amount of human-annotated inference graphs, they
10M, 100M and 1B words (Warstadt et al., 2020b,
propose to exploit reasoning graphs in related tasks
MiniBERTas), showed that 10M to 100M words
to fine-tune T5. In particular, this work leverages
are sufficient to acquire many syntactic and se-
the influence graphs in the WIQA dataset that in-
mantic features. Huebner et al. (2021) presented
cludes a set of procedural passages, each accompa-
BabyBERTa, a RoBERTa-based model trained on
nied by a human-curated influence graph. The in-
language acquisition data that acquires grammat-
fluence graphs are linearized to fit into the seq2seq
ical knowledge comparable to that of pre-trained
framework for fine-tuning T5 and producing infer-
RoBERTa-base – and does so with approximately
ence graphs for defeasible inference afterward. It
15x fewer parameters and 6,000x fewer words. On
has been shown that the generated inference graphs
the other hand, Zhang et al. (2021b), using the pre-
can improve human accuracy on defeasible infer-
train then fine-tune paradigm for NLU tasks, found
ence (which is originally challenging for humans).
that millions of words are not sufficient for key
NLU skills, which instead may require billions of
6 Discussion
words and continue improvements with additional
Mix of paradigms or PLMs. The three pre-training data.
paradigms presented in this paper are by no means
mutually exclusive. Instead, it is not rare to see How much labeled data is still needed? While
approaches that use two or three paradigms to- Le Scao and Rush (2021) present experiments to
gether: fine-tuning techniques are often used as part quantify the impact of prompts, there has been little
of prompt-based methods; NLP-as-text-generation work in designing rigorous experiments to study
approaches often use carefully crafted templates how many labeled examples are required by PLMs
(prompts); and prompt-based learning often lever- to achieve various levels of performance for a range
ages the text generation capabilities of PLMs to of NLP tasks, and using each of the three paradigms
generate words, phrases, or sentences. outlined in this survey. Such studies will provide a
A representative example is Khashabi et al. better understanding of the pros and cons of each
(2020), which combined three paradigms: appro- formulation, including cost-benefit analyses weigh-
priate prompts from the context and questions help ing the impact of more labeled data, helping devel-
to formulate several QA tasks into a unified text opers design NLP systems that achieve the desired
generation problem with seq2seq-based pre-trained goal while minimizing human labeling effort.
28
Can we reduce the amount and cost of compu- mark. Syntax and semantics can also be jointly in-
tation? The development of deep learning in gen- tegrated, as in Zhou et al. (2020a), where multi-task
eral and the use of PLMs in particular have dramat- learning was used to combine BERT pre-training
ically increased the amount of computation used with both semantic and syntactic parsing tasks, im-
in NLP, leading to a high environmental footprint. proving the performance on the GLUE benchmark.
Schwartz et al. (2020) argue for Green AI, suggest-
ing that we should consider efficiency, measured Can we integrate implicit semantic information
by the number of floating-point operations used to using QA? Instead of enriching PLMs with sym-
generate a result, as a main evaluation criterion, to- bolic annotations, a possible alternative for a su-
gether with accuracy. Green AI also aims to reduce pervision signal is QA data, as it is easier to an-
the financial cost of the computation. In line with swer questions relative to a sentence than to an-
this approach, Izsak et al. (2021) propose software notate linguistic phenomena in it (Roth, 2017; He
optimization and design choices for pre-training et al., 2020). In the s-QuASE PLM presented in He
BERT in 24 hours using a single low-end deep et al. (2020), further pre-training of BERT on QA
learning server. datasets is done while restricting the interaction be-
tween the question and context inputs. s-QuASE is
Do PLMs excel at semantic understanding or particularly useful in single-sentence tasks such as
memorization? Another interesting avenue to Semantic Role Labeling and NER. A similar direc-
explore is separating extraction or text under- tion was pursued by Jia et al. (2021) who leveraged
standing from memorization. To what extent can question generation and knowledge distillation to
PLMs memorize facts and extract an answer from build a QA-based pre-training objective.
a passage provided (understanding a text), for Do PLMs need meaningful prompts? The suc-
knowledge-intensive tasks such as Questions An- cess of prompts in zero- and few-shot learning has
swering (QA) and Information Retrieval (IR)? This been attributed to the prompts serving as instruc-
is motivated by the observation by Wang et al. tions that allow the PLM to learn with fewer exam-
(2021a) that PLMs are terrible at remembering ples, much the way humans would (Mishra et al.,
training facts with high precision and that it is also 2021; Schick and Schütze, 2021a; Brown et al.,
challenging for them to answer closed-book ques- 2020). In fact, the excellent results may instead be
tions even if relevant knowledge is retained. attributable to the mere exploitation of patterns in
the training data of PLMs, and not to PLMs’ per-
Is explicit linguistic information needed? A re- ceived ability to interpret and follow meaningful
lated debate is whether a symbolic annotation cov- instructions. Webson and Pavlick (2021) show,
ering syntax or semantics should be integrated to for instance, that irrelevant templates match the
improve the performance of a PLM-based system, performance of meaningful ones in few-shot en-
or whether this information is already present in the tailment experiments, adding that some of the tem-
model. Below we list some successes in leverag- plates discovered by automatic generation of dis-
ing syntax or semantics, though there is no def- crete prompts are also unnatural (Shin et al., 2020).
inite answer yet. In terms of syntax, Xu et al. In this sense, the results of continuous prompts also
(2021) utilize automatically produced syntax in show that PLMs do not need meaningful instruc-
both the pre-training and fine-tuning stages, and tions for improving few-shot performance.
show improved performance on several benchmark
datasets. Nguyen et al. (2020b) and Sachan et al. Theoretical and empirical analysis The theo-
(2021) inject syntax only in the fine-tuning stage. retical understanding of the paradigms presented
Regarding semantics, Zhang et al. (2020d) incor- in this survey is preliminary. Apart from the issues
porate Semantic Role Labeling predictions into mentioned above, there is a lack of understanding
the pre-training procedure of BERT, improving the of what actually makes these paradigms so suc-
performance on textual entailment and QA tasks. cessful, and whether their success can be general-
Wu et al. (2021b) integrate semantic information ized across models and languages. For instance,
into the task-specific fine-tuning stage, focusing on prompts may be PLM-dependent, or they may be
the DELPHIN dependencies formalism or “DM” transferable across models as indicated in (Perez
(Ivanova et al., 2012). Experimenting on RoBERTa, et al., 2021). There is very little work on study-
they obtained improvements on the GLUE bench- ing the generalization of prompting and generation
29
across languages, in the way that transfer learning Rodrigo Agerri, Iñaki San Vicente, Jon Ander Campos,
has been applied to learning in one language and Ander Barrena, Xabier Saralegi, Aitor Soroa, and
Eneko Agirre. 2020. Give your text representation
testing in another (Conneau et al., 2020).
models some love: the case for Basque. In Proceed-
ings of the 12th Language Resources and Evaluation
7 Conclusion Conference, pages 4781–4788, Marseille, France.
European Language Resources Association.
In this paper, we present a survey of the three trend-
ing paradigms that use pre-trained language models Chris Alberti, Daniel Andor, Emily Pitler, Jacob De-
for NLP. We describe each of them in depth, and vlin, and Michael Collins. 2019. Synthetic QA cor-
pora generation with roundtrip consistency. In Pro-
summarize prior works whose applications have ceedings of the 57th Annual Meeting of the Associa-
shown promise. In addition, we describe the use tion for Computational Linguistics.
of pre-trained language models to automatically
Zeyuan Allen-Zhu and Yuanzhi Li. 2021. Towards
generate data that is used to improve performance understanding ensemble, knowledge distillation and
in NLP tasks. We hope this survey will provide self-distillation in deep learning. arXiv preprint
readers with key fundamental concepts and a com- arXiv:2012.09816.
prehensive view of the paradigm shift.
Emily Alsentzer, John Murphy, William Boag, Wei-
Hung Weng, Di Jindi, Tristan Naumann, and
Acknowledgments Matthew McDermott. 2019. Publicly available clini-
cal BERT embeddings. In Proceedings of the 2nd
This research is based upon work supported in part Clinical Natural Language Processing Workshop,
by the Office of the Director of National Intelli- pages 72–78, Minneapolis, Minnesota, USA. Asso-
gence (ODNI), Intelligence Advanced Research ciation for Computational Linguistics.
Projects Activity (IARPA), via Contract No. 2019- Asaf Amrami and Yoav Goldberg. 2019. Towards bet-
19051600006 under the IARPA BETTER program ter substitution-based word sense induction. arXiv
and by Contracts FA8750- 19-2-0201 and FA8750- preprint arXiv:1905.12598.
19-2-1004 with the US Defense Advanced Re-
Ateret Anaby-Tavor, Boaz Carmeli, Esther Goldbraich,
search Projects Agency (DARPA). Approved for Amir Kantor, George Kour, Segev Shlomov, Naama
Public Release, Distribution Unlimited. The views Tepper, and Naama Zwerdling. 2020. Do not have
and conclusions contained herein are those of the enough data? deep learning to the rescue! In The
authors and should not be interpreted as necessarily Thirty-Fourth AAAI Conference on Artificial Intelli-
gence (AAAI).
representing the official policies, either expressed
or implied, of ODNI, IARPA, the Department of Ben Athiwaratkun, Cicero Nogueira dos Santos, Jason
Defense or the U.S. Government. The U.S. Gov- Krone, and Bing Xiang. 2020. Augmented natural
ernment is authorized to reproduce and distribute language for generative sequence labeling. In Pro-
ceedings of the 2020 Conference on Empirical Meth-
reprints for governmental purposes not withstand- ods in Natural Language Processing (EMNLP).
ing any copyright annotation therein.
We would like to thank Paul Cummer for his Geoff Bacon and Terry Regier. 2019. Does bert
agree? evaluating knowledge of structure depen-
insightful comments on this work. dence through agreement relations. arXiv preprint
arXiv:1908.09892.
30
Ankur Bapna and Orhan Firat. 2019. Simple, Scal- 58th Annual Meeting of the Association for Compu-
able Adaptation for Neural Machine Translation. In tational Linguistics, pages 1290–1301, Online. As-
Proceedings of the 2019 Conference on Empirical sociation for Computational Linguistics.
Methods in Natural Language Processing and the
9th International Joint Conference on Natural Lan- Oana-Maria Camburu, Tim Rocktäschel, Thomas
guage Processing (EMNLP-IJCNLP), pages 1538– Lukasiewicz, and Phil Blunsom. 2018. e-snli: Nat-
1548, Hong Kong, China. Association for Computa- ural language inference with natural language expla-
tional Linguistics. nations. In Advances in Neural Information Process-
ing Systems.
Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. SciB-
ERT: A Pretrained Language Model for Scientific José Cañete, Gabriel Chaperon, Rodrigo Fuentes, Jou-
Text. In Proceedings of the 2019 Conference on Hui Ho, Hojin Kang, and Jorge Pérez. 2020. Span-
Empirical Methods in Natural Language Processing ish pre-trained BERT model and evaluation data. In
and the 9th International Joint Conference on Natu- PML4DC Workshop at ICLR.
ral Language Processing (EMNLP-IJCNLP), pages
3615–3620, Hong Kong, China. Association for Ilias Chalkidis, Manos Fergadiotis, Prodromos Malaka-
Computational Linguistics. siotis, Nikolaos Aletras, and Ion Androutsopoulos.
2020. LEGAL-BERT: The muppets straight out of
Eyal Ben-David, Nadav Oved, and Roi Reichart. 2021. law school. In Findings of the Association for Com-
Pada: A prompt-based autoregressive approach for putational Linguistics: EMNLP 2020, pages 2898–
adaptation to unseen domains. arXiv preprint 2904, Online. Association for Computational Lin-
arXiv:2102.12206. guistics.
Sid Black, Gao Leo, Phil Wang, Connor Leahy, and Branden Chan, Stefan Schweter, and Timo Möller.
Stella Biderman. 2021. GPT-Neo: Large Scale 2020. German’s next language model. In Proceed-
Autoregressive Language Modeling with Mesh- ings of the 28th International Conference on Com-
Tensorflow. putational Linguistics, pages 6788–6796, Barcelona,
Antoine Bosselut, Hannah Rashkin, Maarten Sap, Chai- Spain (Online). International Committee on Compu-
tanya Malaviya, Asli Celikyilmaz, and Yejin Choi. tational Linguistics.
2019. Comet: Commonsense transformers for auto- Wanxiang Che, Longxu Dou, Yang Xu, Yuxuan Wang,
matic knowledge graph construction. arXiv preprint Yijia Liu, and Ting Liu. 2019. HIT-SCIR at MRP
arXiv:1906.05317. 2019: A unified pipeline for meaning representa-
Samuel R. Bowman, Gabor Angeli, Christopher Potts, tion parsing via efficient training and effective en-
and Christopher D. Manning. 2015. A large anno- coding. In Proceedings of the Shared Task on Cross-
tated corpus for learning natural language inference. Framework Meaning Representation Parsing at the
In Proceedings of the 2015 Conference on Empiri- 2019 Conference on Natural Language Learning,
cal Methods in Natural Language Processing, pages pages 76–85, Hong Kong. Association for Compu-
632–642, Lisbon, Portugal. Association for Compu- tational Linguistics.
tational Linguistics.
Danqi Chen, Adam Fisch, Jason Weston, and Antoine
Samuel Broscheit. 2019. Investigating entity knowl- Bordes. 2017. Reading Wikipedia to answer open-
edge in BERT with simple neural end-to-end en- domain questions. In Proceedings of the 55th An-
tity linking. In Proceedings of the 23rd Confer- nual Meeting of the Association for Computational
ence on Computational Natural Language Learning Linguistics (ACL).
(CoNLL), pages 677–685, Hong Kong, China. Asso-
ciation for Computational Linguistics. Jiaao Chen, Derek Tam, Colin Raffel, Mohit Bansal,
and Diyi Yang. 2021a. An empirical survey of data
Tom Brown, Benjamin Mann, Nick Ryder, Melanie augmentation for limited data learning in nlp. arXiv
Subbiah, Jared D Kaplan, Prafulla Dhariwal, preprint arXiv:2106.07499.
Arvind Neelakantan, Pranav Shyam, Girish Sastry,
Amanda Askell, Sandhini Agarwal, Ariel Herbert- Jianbo Chen, Le Song, Martin J. Wainwright, and
Voss, Gretchen Krueger, Tom Henighan, Rewon Michael I. Jordan. 2018. Learning to explain: An
Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, information-theoretic perspective on model interpre-
Clemens Winter, Chris Hesse, Mark Chen, Eric tation. In Proceedings of the 35th International Con-
Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, ference on Machine Learning (ICML).
Jack Clark, Christopher Berner, Sam McCandlish,
Alec Radford, Ilya Sutskever, and Dario Amodei. Mark Chen, Jerry Tworek, Heewoo Jun, Qiming
2020. Language models are few-shot learners. In Yuan, Henrique Ponde de Oliveira Pinto, Jared Ka-
Advances in Neural Information Processing Systems, plan, Harri Edwards, Yuri Burda, Nicholas Joseph,
volume 33, pages 1877–1901. Curran Associates, Greg Brockman, Alex Ray, Raul Puri, Gretchen
Inc. Krueger, Michael Petrov, Heidy Khlaaf, Girish Sas-
try, Pamela Mishkin, Brooke Chan, Scott Gray,
Deng Cai and Wai Lam. 2020. AMR parsing via graph- Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz
sequence iterative inference. In Proceedings of the Kaiser, Mohammad Bavarian, Clemens Winter,
31
Philippe Tillet, Felipe Petroski Such, Dave Cum- (Long and Short Papers), pages 2924–2936, Min-
mings, Matthias Plappert, Fotios Chantzis, Eliza- neapolis, Minnesota. Association for Computational
beth Barnes, Ariel Herbert-Voss, William Hebgen Linguistics.
Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie
Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, Kevin Clark, Urvashi Khandelwal, Omer Levy, and
William Saunders, Christopher Hesse, Andrew N. Christopher D. Manning. 2019b. What does BERT
Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan look at? an analysis of BERT’s attention. In Pro-
Morikawa, Alec Radford, Matthew Knight, Miles ceedings of the 2019 ACL Workshop BlackboxNLP:
Brundage, Mira Murati, Katie Mayer, Peter Welin- Analyzing and Interpreting Neural Networks for
der, Bob McGrew, Dario Amodei, Sam McCandlish, NLP, pages 276–286, Florence, Italy. Association
Ilya Sutskever, and Wojciech Zaremba. 2021b. Eval- for Computational Linguistics.
uating Large Language Models Trained on Code. Ronan Collobert and Jason Weston. 2008. A unified
arXiv:2107.03374 [cs]. ArXiv: 2107.03374. architecture for natural language processing: deep
neural networks with multitask learning. In Pro-
Pinzhen Chen, Nikolay Bogoychev, Kenneth Heafield, ceedings of the 25th international conference on Ma-
and Faheem Kirefu. 2020a. Parallel sentence min- chine learning, ICML ’08, pages 160–167, New
ing by constrained decoding. In Proceedings of the York, NY, USA. Association for Computing Machin-
58th Annual Meeting of the Association for Compu- ery.
tational Linguistics (ACL).
Alexis Conneau, Kartikay Khandelwal, Naman Goyal,
Qian Chen, Zhu Zhuo, and Wen Wang. 2019. Bert Vishrav Chaudhary, Guillaume Wenzek, Francisco
for joint intent classification and slot filling. arXiv Guzmán, Edouard Grave, Myle Ott, Luke Zettle-
preprint arXiv:1902.10909. moyer, and Veselin Stoyanov. 2020. Unsupervised
cross-lingual representation learning at scale. arXiv
Qianglong Chen, Feng Ji, Xiangji Zeng, Feng-Lin Li, preprint arXiv:1911.02116.
Ji Zhang, Haiqing Chen, and Yin Zhang. 2021c.
KACE: Generating knowledge aware contrastive ex- Alexis Conneau, Ruty Rinott, Guillaume Lample, Ad-
planations for natural language inference. In Pro- ina Williams, Samuel Bowman, Holger Schwenk,
ceedings of the 59th Annual Meeting of the Associa- and Veselin Stoyanov. 2018. XNLI: Evaluating
tion for Computational Linguistics and the 11th In- cross-lingual sentence representations. In Proceed-
ternational Joint Conference on Natural Language ings of the 2018 Conference on Empirical Methods
Processing (ACL). in Natural Language Processing, pages 2475–2485,
Brussels, Belgium. Association for Computational
Wenhu Chen, Jianshu Chen, Yu Su, Zhiyu Chen, and Linguistics.
William Yang Wang. 2020b. Logical natural lan-
guage generation from open-domain tables. In Pro- Samuel Coope, Tyler Farghly, Daniela Gerz, Ivan Vulić,
ceedings of the 58th Annual Meeting of the Associa- and Matthew Henderson. 2020. Span-ConveRT:
tion for Computational Linguistics (ACL). Few-shot span extraction for dialog with pretrained
conversational representations. In Proceedings of
Xiang Chen, Ningyu Zhang, Xin Xie, Shumin the 58th Annual Meeting of the Association for Com-
Deng, Yunzhi Yao, Chuanqi Tan, Fei Huang, putational Linguistics, pages 107–121, Online. As-
Luo Si, and Huajun Chen. 2021d. Knowprompt: sociation for Computational Linguistics.
Knowledge-aware prompt-tuning with synergistic
optimization for relation extraction. arXiv preprint Asa Cooper Stickland, Xian Li, and Marjan
arXiv:2104.07650. Ghazvininejad. 2021. Recipes for adapting
pre-trained monolingual and multilingual models
Yunmo Chen, Tongfei Chen, Seth Ebner, Aaron Steven to machine translation. In Proceedings of the
White, and Benjamin Van Durme. 2020c. Reading 16th Conference of the European Chapter of the
the manual: Event extraction as definition compre- Association for Computational Linguistics: Main
hension. In Proceedings of the Fourth Workshop on Volume, pages 3440–3453, Online. Association for
Structured Prediction for NLP, pages 74–83, Online. Computational Linguistics.
Association for Computational Linguistics. Corinna Cortes and Vladimir Vapnik. 1995. Support-
Avihay Chriqui and Inbal Yahav. 2021. Hebert & vector networks. Machine learning, 20(3):273–297.
hebemo: a hebrew bert model and a tool for polar- Leyang Cui, Yu Wu, Jian Liu, Sen Yang, and Yue
ity analysis and emotion recognition. arXiv preprint Zhang. 2021. Template-based named entity recog-
arXiv:2102.01909. nition using BART. In Findings of the Association
for Computational Linguistics (ACL-IJCNLP).
Christopher Clark, Kenton Lee, Ming-Wei Chang,
Tom Kwiatkowski, Michael Collins, and Kristina Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, Shi-
Toutanova. 2019a. BoolQ: Exploring the surprising jin Wang, and Guoping Hu. 2020. Revisiting pre-
difficulty of natural yes/no questions. In Proceed- trained models for Chinese natural language process-
ings of the 2019 Conference of the North American ing. In Findings of the Association for Computa-
Chapter of the Association for Computational Lin- tional Linguistics: EMNLP 2020, pages 657–668,
guistics: Human Language Technologies, Volume 1 Online. Association for Computational Linguistics.
32
Ido Dagan, Dan Roth, Mark Sammons, and Fabio Mas- Avia Efrat and Omer Levy. 2020. The turking test: Can
simo Zanzotto. 2013. Recognizing textual entail- language models understand instructions? arXiv
ment: Models and applications. Synthesis Lectures preprint arXiv:2010.11982.
on Human Language Technologies, 6(4):1–220.
Dumitru Erhan, Yoshua Bengio, Aaron Courville,
Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Car- Pierre-Antoine Manzagol, Pascal Vincent, and Samy
bonell, Quoc Le, and Ruslan Salakhutdinov. 2019. Bengio. 2010. Why Does Unsupervised Pre-training
Transformer-XL: Attentive Language Models be- Help Deep Learning? Journal of Machine Learning
yond a Fixed-Length Context. In Proceedings of Research, 11(19):625–660.
the 57th Annual Meeting of the Association for Com-
putational Linguistics, pages 2978–2988, Florence, Allyson Ettinger. 2020. What BERT is not: Lessons
Italy. Association for Computational Linguistics. from a new suite of psycholinguistic diagnostics for
language models. Transactions of the Association
Yann N Dauphin, Angela Fan, Michael Auli, and David for Computational Linguistics, 8:34–48.
Grangier. 2017. Language modeling with gated con-
volutional networks. In International conference on Mehrdad Farahani, Mohammad Gharachorloo,
machine learning, pages 933–941. PMLR. Marzieh Farahani, and Mohammad Manthouri.
2021. Parsbert: Transformer-based model for
Nicola De Cao, Gautier Izacard, Sebastian Riedel, and persian language understanding. Neural Processing
Fabio Petroni. 2021. Autoregressive entity retrieval. Letters.
In Proceedings of the 9th International Conference
on Learning Representations (ICLR). Joshua Feldman, Joe Davison, and Alexander M. Rush.
2019. Commonsense knowledge mining from pre-
Pieter Delobelle, Thomas Winters, and Bettina Berendt. trained models. arXiv preprint arXiv:1909.00505.
2020. RobBERT: a Dutch RoBERTa-based Lan-
guage Model. In Findings of the Association for Rui Feng, Jie Yuan, and Chao Zhang. 2020a. Prob-
Computational Linguistics: EMNLP 2020, pages ing and fine-tuning reading comprehension mod-
3255–3265, Online. Association for Computational els for few-shot event extraction. arXiv preprint
Linguistics. arXiv:2010.11325.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Steven Y. Feng, Varun Gangal, Jason Wei, Sarath Chan-
Kristina Toutanova. 2019. BERT: Pre-training of dar, Soroush Vosoughi, Teruko Mitamura, and Ed-
deep bidirectional transformers for language under- uard Hovy. 2021. A survey of data augmentation
standing. In Proceedings of the 2019 Conference approaches for NLP. In Findings of the Association
of the North American Chapter of the Association for Computational Linguistics (ACL-IJCNLP).
for Computational Linguistics: Human Language
Technologies, Volume 1 (Long and Short Papers), Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan,
pages 4171–4186, Minneapolis, Minnesota. Associ- Xiaocheng Feng, Ming Gong, Linjun Shou, Bing
ation for Computational Linguistics. Qin, Ting Liu, Daxin Jiang, and Ming Zhou. 2020b.
CodeBERT: A pre-trained model for programming
Xinya Du and Claire Cardie. 2020. Event extrac- and natural languages. In Findings of the Associa-
tion by answering (almost) natural questions. In tion for Computational Linguistics: EMNLP 2020,
Proceedings of the 2020 Conference on Empirical pages 1536–1547, Online. Association for Compu-
Methods in Natural Language Processing (EMNLP), tational Linguistics.
pages 671–683, Online. Association for Computa-
tional Linguistics. Steven Fincke, Shantanu Agarwal, Scott Miller, and
Elizabeth Boschee. 2021. Language model prim-
Xinya Du, Luheng He, Qi Li, Dian Yu, Panupong Pa- ing for cross-lingual event extraction. arXiv preprint
supat, and Yuan Zhang. 2021a. QA-driven zero- arXiv:2109.12383.
shot slot filling with weak supervision pretraining.
In Proceedings of the 59th Annual Meeting of the Leo Gao, Stella Biderman, Sid Black, Laurence Gold-
Association for Computational Linguistics and the ing, Travis Hoppe, Charles Foster, Jason Phang, Ho-
11th International Joint Conference on Natural Lan- race He, Anish Thite, Noa Nabeshima, et al. 2020a.
guage Processing (Volume 2: Short Papers), pages The pile: An 800gb dataset of diverse text for lan-
654–664, Online. Association for Computational guage modeling. arXiv preprint arXiv:2101.00027.
Linguistics.
Shuyang Gao, Sanchit Agarwal, Di Jin, Tagyoung
Xinya Du, Alexander Rush, and Claire Cardie. 2021b. Chung, and Dilek Hakkani-Tur. 2020b. From ma-
Template filling with generative transformers. In chine reading comprehension to dialogue state track-
Proceedings of the 2021 Conference of the North ing: Bridging the gap. In Proceedings of the 2nd
American Chapter of the Association for Computa- Workshop on Natural Language Processing for Con-
tional Linguistics: Human Language Technologies versational AI, pages 79–89, Online. Association for
(NAACL-HLT). Computational Linguistics.
33
Shuyang Gao, Abhishek Sethi, Sanchit Agarwal, Tagy- Karen Hambardzumyan, Hrant Khachatrian, and
oung Chung, and Dilek Hakkani-Tur. 2019. Dialog Jonathan May. 2021. WARP: Word-level Adversar-
state tracking: A neural reading comprehension ap- ial ReProgramming. In Proceedings of the 59th An-
proach. In Proceedings of the 20th Annual SIGdial nual Meeting of the Association for Computational
Meeting on Discourse and Dialogue, pages 264–273, Linguistics and the 11th International Joint Confer-
Stockholm, Sweden. Association for Computational ence on Natural Language Processing (Volume 1:
Linguistics. Long Papers), pages 4921–4933, Online. Associa-
tion for Computational Linguistics.
Tianyu Gao, Adam Fisch, and Danqi Chen. 2021a.
Making pre-trained language models better few-shot Rujun Han, Luca Soldaini, and Alessandro Moschitti.
learners. In Proceedings of the 59th Annual Meet- 2021a. Modeling context in answer sentence selec-
ing of the Association for Computational Linguistics tion systems on a latency budget. In Proceedings of
and the 11th International Joint Conference on Nat- the 16th Conference of the European Chapter of the
ural Language Processing (Volume 1: Long Papers), Association for Computational Linguistics (EACL).
pages 3816–3830, Online. Association for Computa-
tional Linguistics. Xu Han, Weilin Zhao, Ning Ding, Zhiyuan Liu,
and Maosong Sun. 2021b. Ptr: Prompt tuning
Yifan Gao, Henghui Zhu, Patrick Ng, Cicero with rules for text classification. arXiv preprint
Nogueira dos Santos, Zhiguo Wang, Feng Nan, De- arXiv:2105.11259.
jiao Zhang, Ramesh Nallapati, Andrew O Arnold,
and Bing Xiang. 2021b. Answering ambiguous Adi Haviv, Jonathan Berant, and Amir Globerson.
questions through generative evidence fusion and 2021. BERTese: Learning to speak to BERT. In
round-trip prediction. In Proceedings of the 59th An- Proceedings of the 16th Conference of the European
nual Meeting of the Association for Computational Chapter of the Association for Computational Lin-
Linguistics (ACL). guistics: Main Volume, pages 3618–3623, Online.
Association for Computational Linguistics.
Yoav Goldberg. 2019. Assessing bert’s syntactic abili-
ties. arXiv preprint arXiv:1901.05287. Hangfeng He, Qiang Ning, and Dan Roth. 2020.
QuASE: Question-answer driven sentence encoding.
Ankur Goswami, Akshata Bhat, Hadar Ohana, and In Proceedings of the 58th Annual Meeting of the
Theodoros Rekatsinas. 2020. Unsupervised relation Association for Computational Linguistics, pages
extraction from language models using constrained 8743–8758, Online. Association for Computational
cloze completion. In Findings of the Association for Linguistics.
Computational Linguistics (EMNLP).
Luheng He, Kenton Lee, Omer Levy, and Luke Zettle-
Kristina Gulordava, Piotr Bojanowski, Edouard Grave, moyer. 2018. Jointly predicting predicates and argu-
Tal Linzen, and Marco Baroni. 2018. Colorless ments in neural semantic role labeling. In Proceed-
green recurrent networks dream hierarchically. In ings of the 56th Annual Meeting of the Association
Proceedings of the 2018 Conference of the North for Computational Linguistics (Volume 2: Short Pa-
American Chapter of the Association for Computa- pers), pages 364–369, Melbourne, Australia. Asso-
tional Linguistics: Human Language Technologies, ciation for Computational Linguistics.
Volume 1 (Long Papers), pages 1195–1205, New
Orleans, Louisiana. Association for Computational Charles T. Hemphill, John J. Godfrey, and George R.
Linguistics. Doddington. 1990. The ATIS spoken language sys-
tems pilot corpus. In Speech and Natural Language:
Demi Guo, Alexander M. Rush, and Yoon Kim. Proceedings of a Workshop Held at Hidden Valley,
2021. Parameter-Efficient Transfer Learning with Pennsylvania, June 24-27,1990.
Diff Pruning. arXiv:2012.07463 [cs]. ArXiv:
2012.07463. Dan Hendrycks, Xiaoyuan Liu, Eric Wallace, Adam
Dziedzic, Rishabh Krishnan, and Dawn Song. 2020.
Ruohao Guo and Dan Roth. 2021. Constrained labeled Pretrained transformers improve out-of-distribution
data generation for low-resource named entity recog- robustness. In Proceedings of the 58th Annual Meet-
nition. In Findings of the Association for Computa- ing of the Association for Computational Linguistics,
tional Linguistics: ACL-IJCNLP 2021. pages 2744–2751, Online. Association for Computa-
tional Linguistics.
Christian Hadiwinoto, Hwee Tou Ng, and Wee Chung
Gan. 2019. Improved word sense disambiguation us- Sepp Hochreiter and Jürgen Schmidhuber. 1997.
ing pre-trained contextualized word representations. Long short-term memory. Neural computation,
In Proceedings of the 2019 Conference on Empirical 9(8):1735–1780.
Methods in Natural Language Processing and the
9th International Joint Conference on Natural Lan- Ehsan Hosseini-Asl, Bryan McCann, Chien-Sheng Wu,
guage Processing (EMNLP-IJCNLP), pages 5297– Semih Yavuz, and Richard Socher. 2020. A simple
5306, Hong Kong, China. Association for Computa- language model for task-oriented dialogue. arXiv
tional Linguistics. preprint arXiv:2005.00796.
34
Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Peter Izsak, Moshe Berchansky, and Omer Levy. 2021.
Bruna Morrone, Quentin De Laroussilhe, Andrea How to train bert with an academic budget. arXiv
Gesmundo, Mona Attariyan, and Sylvain Gelly. preprint arXiv:2104.07705.
2019. Parameter-Efficient Transfer Learning for
NLP. In International Conference on Machine Robin Jia, Mike Lewis, and Luke Zettlemoyer. 2021.
Learning, pages 2790–2799. PMLR. Question answering infused pre-training of general-
purpose contextualized representations. arXiv
Jeremy Howard and Sebastian Ruder. 2018. Universal preprint arXiv:2106.08190.
Language Model Fine-tuning for Text Classification.
In Proceedings of the 56th Annual Meeting of the Wei Jiang, Zhenghua Li, Yu Zhang, and Min Zhang.
Association for Computational Linguistics (Volume 2019. HLT@SUDA at SemEval-2019 task 1:
1: Long Papers), pages 328–339, Melbourne, Aus- UCCA graph parsing as constituent tree parsing.
tralia. Association for Computational Linguistics. In Proceedings of the 13th International Workshop
on Semantic Evaluation, pages 11–15, Minneapo-
Chao-Chun Hsu, Eric Lind, Luca Soldaini, and lis, Minnesota, USA. Association for Computational
Alessandro Moschitti. 2021. Answer generation for Linguistics.
retrieval-based question answering systems. In Find-
ings of the Association for Computational Linguis- Zhengbao Jiang, Antonios Anastasopoulos, Jun Araki,
tics (ACL-IJCNLP). Haibo Ding, and Graham Neubig. 2020. X-FACTR:
Multilingual factual knowledge retrieval from pre-
Kexin Huang, Jaan Altosaar, and Rajesh Ran- trained language models. In Proceedings of the
ganath. 2020. ClinicalBERT: Modeling Clini- 2020 Conference on Empirical Methods in Natural
cal Notes and Predicting Hospital Readmission. Language Processing (EMNLP), pages 5943–5959,
arXiv:1904.05342 [cs]. ArXiv: 1904.05342. Online. Association for Computational Linguistics.
Patrick Huber, Armen Aghajanyan, Barlas Oğuz, Zhengbao Jiang, Jun Araki, Haibo Ding, and Graham
Dmytro Okhonko, Wen tau Yih, Sonal Gupta, and Neubig. 2021. How can we know when language
Xilun Chen. 2021. Ccqa: A new web-scale ques- models know? on the calibration of language mod-
tion answering dataset for model pre-training. arXiv els for question answering. Transactions of the As-
preprint arXiv:2110.07731. sociation of Computational Linguistics, 8:423–438.
35
criticism for interpretability. In Advances in Neu- Zhenzhong Lan, Mingda Chen, Sebastian Goodman,
ral Information Processing Systems 29: Annual Con- Kevin Gimpel, Piyush Sharma, and Radu Sori-
ference on Neural Information Processing Systems cut. 2020. ALBERT: A Lite BERT for Self-
(NIPS). supervised Learning of Language Representations.
arXiv:1909.11942 [cs].
Jinkyu Kim, Anna Rohrbach, Trevor Darrell, John F.
Canny, and Zeynep Akata. 2018. Textual expla- Hang Le, Loı̈c Vial, Jibril Frej, Vincent Segonne, Max-
nations for self-driving vehicles. In In Proceed- imin Coavoux, Benjamin Lecouteux, Alexandre Al-
ings of the European conference on computer vision lauzen, Benoit Crabbé, Laurent Besacier, and Didier
(ECCV). Schwab. 2020. FlauBERT: Unsupervised language
model pre-training for French. In Proceedings of
Karen Kukich. 1983. Design of a knowledge-based re- the 12th Language Resources and Evaluation Con-
port generator. In 21st Annual Meeting of the Asso- ference, pages 2479–2490, Marseille, France. Euro-
ciation for Computational Linguistics (ACL). pean Language Resources Association.
Sawan Kumar and Partha Talukdar. 2021. Reorder-
ing examples helps during priming-based few-shot Teven Le Scao and Alexander Rush. 2021. How many
learning. In Findings of the Association for Com- data points is a prompt worth? In Proceedings of the
putational Linguistics: ACL-IJCNLP 2021, pages 2021 Conference of the North American Chapter of
4507–4518, Online. Association for Computational the Association for Computational Linguistics: Hu-
Linguistics. man Language Technologies, pages 2627–2636, On-
line. Association for Computational Linguistics.
Sawan Kumar and Partha P. Talukdar. 2020. NILE :
Natural language inference with faithful natural lan- Yann LeCun, Yoshua Bengio, and Geoffrey Hinton.
guage explanations. In Proceedings of the 58th An- 2015. Deep learning. nature, 521(7553):436–444.
nual Meeting of the Association for Computational
Linguistics, (ACL). Jinhyuk Lee, Wonjin Yoon, Sungdong Kim,
Donghyeon Kim, Sunkyu Kim, Chan Ho So,
Varun Kumar, Ashutosh Choudhary, and Eunah Cho. and Jaewoo Kang. 2019. BioBERT: a pre-
2020. Data augmentation using pre-trained trans- trained biomedical language representation model
former models. In arXiv. for biomedical text mining. Bioinformatics,
36(4):1234–1240.
Yuri Kuratov and Mikhail Arkhipov. 2019. Adaptation
of deep bidirectional multilingual transformers for Kenton Lee, Luheng He, and Luke Zettlemoyer. 2018.
russian language. arXiv preprint arXiv:1905.07213. Higher-order coreference resolution with coarse-to-
fine inference. In Proceedings of the 2018 Confer-
John Lafferty and Chengxiang Zhai. 2001. Document ence of the North American Chapter of the Associ-
language models, query models, and risk minimiza- ation for Computational Linguistics: Human Lan-
tion for information retrieval. In Proceedings of the guage Technologies, Volume 2 (Short Papers), pages
24th Annual International ACM SIGIR Conference 687–692, New Orleans, Louisiana. Association for
on Research and Development in Information Re- Computational Linguistics.
trieval.
Brian Lester, Rami Al-Rfou, and Noah Constant. 2021.
John D. Lafferty, Andrew McCallum, and Fernando
The power of scale for parameter-efficient prompt
C. N. Pereira. 2001. Conditional random fields:
tuning. In EMNLP.
Probabilistic models for segmenting and labeling se-
quence data. In Proceedings of the Eighteenth Inter- Hector Levesque, Ernest Davis, and Leora Morgen-
national Conference on Machine Learning, ICML stern. 2012. The winograd schema challenge. In
’01, page 282–289, San Francisco, CA, USA. Mor- Thirteenth International Conference on the Princi-
gan Kaufmann Publishers Inc. ples of Knowledge Representation and Reasoning.
Huiyuan Lai, Antonio Toral, and Malvina Nissim.
2021. Thank you bart! rewarding pre-trained mod- Omer Levy, Minjoon Seo, Eunsol Choi, and Luke
els improves formality style transfer. arXiv preprint Zettlemoyer. 2017. Zero-shot relation extraction via
arXiv:2105.06947. reading comprehension. In Proceedings of the 21st
Conference on Computational Natural Language
Guillaume Lample and Alexis Conneau. 2019. Cross- Learning (CoNLL 2017), pages 333–342, Vancou-
lingual language model pretraining. arXiv preprint ver, Canada. Association for Computational Linguis-
arXiv:. tics.
Guillaume Lample, Myle Ott, Alexis Conneau, Lu- Mike Lewis, Yinhan Liu, Naman Goyal, Mar-
dovic Denoyer, and Marc’Aurelio Ranzato. 2018. jan Ghazvininejad, Abdelrahman Mohamed, Omer
Phrase-based & neural unsupervised machine trans- Levy, Veselin Stoyanov, and Luke Zettlemoyer.
lation. In Proceedings of the 2018 Conference on 2020. BART: Denoising sequence-to-sequence pre-
Empirical Methods in Natural Language Processing, training for natural language generation, translation,
pages 5039–5049, Brussels, Belgium. Association and comprehension. In Proceedings of the 58th An-
for Computational Linguistics. nual Meeting of the Association for Computational
36
Linguistics, pages 7871–7880, Online. Association Xin Li, Lidong Bing, Wenxuan Zhang, and Wai Lam.
for Computational Linguistics. 2019b. Exploiting BERT for End-to-End Aspect-
based Sentiment Analysis. In Proceedings of the 5th
Dianqi Li, Yizhe Zhang, Hao Peng, Liqun Chen, Chris Workshop on Noisy User-generated Text (W-NUT
Brockett, Ming-Ting Sun, and Bill Dolan. 2021a. 2019), pages 34–41, Hong Kong, China. Association
Contextualized perturbation for textual adversarial for Computational Linguistics.
attack. In Proceedings of the 2021 Conference of
the North American Chapter of the Association for Stephanie Lin, Jacob Hilton, and Owain Evans. 2021.
Computational Linguistics: Human Language Tech- TruthfulQA: Measuring How Models Mimic Hu-
nologies (NAACL-HLT). man Falsehoods. arXiv:2109.07958 [cs]. ArXiv:
2109.07958.
Fayuan Li, Weihua Peng, Yuguang Chen, Quan Wang,
Lu Pan, Yajuan Lyu, and Yong Zhu. 2020a. Event Jeffrey Ling, Nicholas FitzGerald, Zifei Shan,
extraction as multi-turn question answering. In Find- Livio Baldini Soares, Thibault Févry, David Weiss,
ings of the Association for Computational Linguis- and Tom Kwiatkowski. 2020. Learning cross-
tics: EMNLP 2020, pages 829–838, Online. Associ- context entity representations from text. arXiv
ation for Computational Linguistics. preprint arXiv:2001.03765.
Junyi Li, Tianyi Tang, Wayne Xin Zhao, and Ji- Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blun-
Rong Wen. 2021b. Pretrained language models som. 2017. Program induction by rationale genera-
for text generation: A survey. arXiv preprint tion: Learning to solve and explain algebraic word
arXiv:2105.10311. problems. In Proceedings of the 55th Annual Meet-
ing of the Association for Computational Linguistics
Kun Li, Chengbo Chen, Xiaojun Quan, Qing Ling, (ACL).
and Yan Song. 2020b. Conditional augmentation
for aspect term extraction via masked sequence-to- Tal Linzen. 2020. How can we accelerate progress to-
sequence generation. In Proceedings of the 58th An- wards human-like linguistic generalization? In Pro-
nual Meeting of the Association for Computational ceedings of the 58th Annual Meeting of the Asso-
Linguistics (ACL). ciation for Computational Linguistics, pages 5210–
5217, Online. Association for Computational Lin-
Qi Li, Heng Ji, Yu Hong, and Sujian Li. 2014. Con- guistics.
structing information networks using one single
Tal Linzen, Emmanuel Dupoux, and Yoav Goldberg.
model. In Proceedings of the 2014 Conference on
Assessing the ability of lstms to learn syntax-
Empirical Methods in Natural Language Processing
sensitive dependencies. Transactions of the Associa-
(EMNLP). Association for Computational Linguis-
tion for Computational Linguistics, 4:521–535.
tics.
Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan,
Sha Li, Heng Ji, and Jiawei Han. 2021c. Document- Lawrence Carin, and Weizhu Chen. 2021a. What
level event argument extraction by conditional gener- makes good in-context examples for gpt-3? arXiv
ation. In Proceedings of the 2021 Conference of the preprint arXiv:2101.06804.
North American Chapter of the Association for Com-
putational Linguistics: Human Language Technolo- Jian Liu, Yubo Chen, Kang Liu, Wei Bi, and Xiaojiang
gies, pages 894–908, Online. Association for Com- Liu. 2020a. Event extraction as machine reading
putational Linguistics. comprehension. In Proceedings of the 2020 Con-
ference on Empirical Methods in Natural Language
Xiang Lisa Li and Percy Liang. 2021. Prefix- Processing (EMNLP), pages 1641–1651, Online. As-
tuning: Optimizing continuous prompts for genera- sociation for Computational Linguistics.
tion. arXiv preprint arXiv:2101.00190.
Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang,
Xiaoya Li, Jingrong Feng, Yuxian Meng, Qinghong Hiroaki Hayashi, and Graham Neubig. 2021b. Pre-
Han, Fei Wu, and Jiwei Li. 2020c. A unified MRC train, Prompt, and Predict: A Systematic Survey of
framework for named entity recognition. In Pro- Prompting Methods in Natural Language Processing.
ceedings of the 58th Annual Meeting of the Asso- arXiv:2107.13586 [cs]. ArXiv: 2107.13586.
ciation for Computational Linguistics, pages 5849–
5859, Online. Association for Computational Lin- Ruibo Liu, Guangxuan Xu, Chenyan Jia, Weicheng
guistics. Ma, Lili Wang, and Soroush Vosoughi. 2020b. Data
boost: Text data augmentation through reinforce-
Xiaoya Li, Fan Yin, Zijun Sun, Xiayu Li, Arianna ment learning guided conditional generation. In Pro-
Yuan, Duo Chai, Mingxin Zhou, and Jiwei Li. 2019a. ceedings of the 2020 Conference on Empirical Meth-
Entity-relation extraction as multi-turn question an- ods in Natural Language Processing (EMNLP).
swering. In Proceedings of the 57th Annual Meet-
ing of the Association for Computational Linguis- Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding,
tics, pages 1340–1350, Florence, Italy. Association Yujie Qian, Zhilin Yang, and Jie Tang. 2021c. Gpt
for Computational Linguistics. understands, too. arXiv preprint arXiv:2103.10385.
37
Yang Liu and Mirella Lapata. 2019. Text summa- me a hint? generating inference graphs for defea-
rization with pretrained encoders. arXiv preprint sible reasoning. In Findings of the Association for
arXiv:1908.08345. Computational Linguistics: ACL-IJCNLP 2021.
Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Aman Madaan, Dheeraj Rajagopal, Yiming Yang, Ab-
Edunov, Marjan Ghazvininejad, Mike Lewis, and hilasha Ravichander, Eduard Hovy, and Shrimai
Luke Zettlemoyer. 2020c. Multilingual denoising Prabhumoye. 2020. EIGEN: event influence genera-
pre-training for neural machine translation. arXiv tion using pre-trained language models. In arXiv.
preprint arXiv:2001.08210.
Nishtha Madaan, Inkit Padhi, Naveen Panwar, and Dip-
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- tikalyan Saha. 2021b. Generate your counterfactu-
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, als: Towards controlled counterfactual generation
Luke Zettlemoyer, and Veselin Stoyanov. 2019. for text. In Thirty-Fifth AAAI Conference on Arti-
RoBERTa: A Robustly Optimized BERT Pretrain- ficial Intelligence, (AAAI).
ing Approach. arXiv:1907.11692 [cs]. Eric Malmi, Sebastian Krause, Sascha Rothe, Daniil
Robert L. Logan IV, Ivana Balažević, Eric Wallace, Mirylenka, and Aliaksei Severyn. 2019. Encode,
Fabio Petroni, Sameer Singh, and Sebastian Riedel. tag, realize: High-precision text editing. In Proceed-
2021. Cutting down on prompts and parameters: ings of the 2019 Conference on Empirical Methods
Simple few-shot learning with language models. in Natural Language Processing and the 9th Inter-
arXiv preprint arXiv:2106.13353. national Joint Conference on Natural Language Pro-
cessing (EMNLP-IJCNLP), pages 5054–5065, Hong
Yao Lu, Max Bartolo, Alastair Moore, Sebastian Kong, China. Association for Computational Lin-
Riedel, and Pontus Stenetorp. 2021a. Fantastically guistics.
ordered prompts and where to find them: Over- Yuning Mao, Pengcheng He, Xiaodong Liu, Yelong
coming few-shot prompt order sensitivity. arXiv Shen, Jianfeng Gao, Jiawei Han, and Weizhu Chen.
preprint arXiv:2104.08786. 2021. Generation-augmented retrieval for open-
domain question answering. In Proceedings of the
Yaojie Lu, Hongyu Lin, Jin Xu, Xianpei Han, Jialong
59th Annual Meeting of the Association for Compu-
Tang, Annan Li, Le Sun, Meng Liao, and Shaoyi
tational Linguistics and the 11th International Joint
Chen. 2021b. Text2Event: Controllable sequence-
Conference on Natural Language Processing (ACL).
to-structure generation for end-to-end event extrac-
tion. In Proceedings of the 59th Annual Meeting of Louis Martin, Angela Fan, Éric de la Clergerie, An-
the Association for Computational Linguistics and toine Bordes, and Benoı̂t Sagot. 2021. Muss: Multi-
the 11th International Joint Conference on Natural lingual unsupervised sentence simplification by min-
Language Processing (ACL). ing paraphrases. arXiv preprint arXiv:2005.00352.
Scott M. Lundberg and Su-In Lee. 2017. A unified Louis Martin, Benjamin Muller, Pedro Javier Or-
approach to interpreting model predictions. In Ad- tiz Suárez, Yoann Dupont, Laurent Romary, Éric
vances in Neural Information Processing Systems de la Clergerie, Djamé Seddah, and Benoı̂t Sagot.
(NIPS). 2020. CamemBERT: a tasty French language model.
In Proceedings of the 58th Annual Meeting of the
Qing Lyu, Hongming Zhang, Elior Sulem, and Dan Association for Computational Linguistics, pages
Roth. 2021. Zero-shot event extraction via trans- 7203–7219, Online. Association for Computational
fer learning: Challenges and insights. In Proceed- Linguistics.
ings of the 59th Annual Meeting of the Association
for Computational Linguistics and the 11th Interna- Yosi Mass, Boaz Carmeli, Haggai Roitman, and David
tional Joint Conference on Natural Language Pro- Konopnicki. 2020. Unsupervised FAQ retrieval with
cessing (Volume 2: Short Papers), pages 322–332, question generation and BERT. In Proceedings of
Online. Association for Computational Linguistics. the 58th Annual Meeting of the Association for Com-
putational Linguistics (ACL).
Shuming Ma, Jian Yang, Haoyang Huang, Zewen Chi,
Li Dong, Dongdong Zhang, Hany Hassan Awadalla, R. Thomas McCoy, Robert Frank, and Tal Linzen.
Alexandre Muzio, Akiko Eriguchi, Saksham Sing- 2020. Does syntax need to grow on trees? sources of
hal, Xia Song, Arul Menezes, and Furu Wei. 2020. hierarchical inductive bias in sequence-to-sequence
Xlm-t: Scaling up multilingual machine translation networks. Transactions of the Association for Com-
with pretrained cross-lingual transformer encoders. putational Linguistics, 8:125–140.
arXiv preprint arXiv:2012.15547.
Julian Michael, Gabriel Stanovsky, Luheng He, Ido Da-
Brian MacWhinney. 2000. The CHILDES Project: gan, and Luke Zettlemoyer. 2018. Crowdsourcing
Tools for analyzing talk. transcription format and question-answer meaning representations. In Pro-
programs, volume 1. Psychology Press. ceedings of the 2018 Conference of the North Amer-
ican Chapter of the Association for Computational
Aman Madaan, Dheeraj Rajagopal, Niket Tandon, Yim- Linguistics: Human Language Technologies, Vol-
ing Yang, and Eduard Hovy. 2021a. Could you give ume 2 (Short Papers), pages 560–568, New Orleans,
38
Louisiana. Association for Computational Linguis- Xuan-Phi Nguyen, Shafiq Joty, Steven C. H. Hoi,
tics. and Richard Socher. 2020b. Tree-structured atten-
tion with hierarchical accumulation. arXiv preprint
Sewon Min, Julian Michael, Hannaneh Hajishirzi, and arXiv:2002.08046.
Luke Zettlemoyer. 2020. AmbigQA: Answering am-
biguous open-domain questions. In Proceedings of Yixin Nie, Adina Williams, Emily Dinan, Mohit
the 2020 Conference on Empirical Methods in Natu- Bansal, Jason Weston, and Douwe Kiela. 2020. Ad-
ral Language Processing (EMNLP). versarial NLI: A new benchmark for natural lan-
guage understanding. In Proceedings of the 58th An-
Swaroop Mishra, Daniel Khashabi, Chitta Baral, and nual Meeting of the Association for Computational
Hannaneh Hajishirzi. 2021. Cross-task generaliza- Linguistics, pages 4885–4901, Online. Association
tion via natural language crowdsourcing instructions. for Computational Linguistics.
arXiv preprint arXiv:2104.08773.
Rodrigo Nogueira, Zhiying Jiang, Ronak Pradeep, and
Makoto Miwa and Mohit Bansal. 2016. End-to-end re- Jimmy Lin. 2020. Document ranking with a pre-
lation extraction using LSTMs on sequences and tree trained sequence-to-sequence model. In Findings
structures. In Proceedings of the 54th Annual Meet- of the Association for Computational Linguistics
ing of the Association for Computational Linguistics (EMNLP).
(ACL). Association for Computational Linguistics.
Abiola Obamuyide and Andreas Vlachos. 2018. Zero-
Khalil Mrini, Franck Dernoncourt, Quan Hung Tran, shot relation classification as textual entailment. In
Trung Bui, Walter Chang, and Ndapa Nakashole. Proceedings of the First Workshop on Fact Extrac-
2020. Rethinking self-attention: Towards inter- tion and VERification (FEVER), pages 72–78, Brus-
pretability in neural parsing. In Findings of the As- sels, Belgium. Association for Computational Lin-
sociation for Computational Linguistics: EMNLP guistics.
2020, pages 731–742, Online. Association for Com-
putational Linguistics. Kostiantyn Omelianchuk, Vipul Raheja, and Oleksandr
Skurzhanskyi. 2021. Text Simplification by Tag-
Mahdi Namazifar, Alexandros Papangelis, Gokhan ging. In Proceedings of the 16th Workshop on Inno-
Tur, and Dilek Hakkani-Tür. 2020. Language vative Use of NLP for Building Educational Appli-
model is all you need: Natural language under- cations, pages 11–25, Online. Association for Com-
standing as question answering. arXiv preprint putational Linguistics.
arXiv:2011.03023.
Liangming Pan, Wenhu Chen, Wenhan Xiong, Min-
Tahira Naseem, Abhishek Shah, Hui Wan, Radu Flo- Yen Kan, and William Yang Wang. 2021a. Zero-
rian, Salim Roukos, and Miguel Ballesteros. 2019. shot fact verification by claim generation. In Pro-
Rewarding Smatch: Transition-based AMR parsing ceedings of the 59th Annual Meeting of the Associa-
with reinforcement learning. In Proceedings of the tion for Computational Linguistics and the 11th In-
57th Annual Meeting of the Association for Com- ternational Joint Conference on Natural Language
putational Linguistics, pages 4586–4592, Florence, Processing (ACL).
Italy. Association for Computational Linguistics.
Liangming Pan, Wenhu Chen, Wenhan Xiong, Min-
Dat Quoc Nguyen, Thanh Vu, and Anh Tuan Nguyen. Yen Kan1, and William Yang Wang. 2021b. Unsu-
2020a. BERTweet: A pre-trained language model pervised multi-hop question answering by question
for English tweets. In Proceedings of the 2020 Con- generation:. In Proceedings of the 2021 Conference
ference on Empirical Methods in Natural Language of the North American Chapter of the Association
Processing: System Demonstrations, pages 9–14, for Computational Linguistics: Human Language
Online. Association for Computational Linguistics. Technologies (ACL-HLT).
Minh Van Nguyen, Viet Dac Lai, Amir Pouran Giovanni Paolini, Ben Athiwaratkun, Jason Krone,
Ben Veyseh, and Thien Huu Nguyen. 2021. Trankit: Jie Ma, Alessandro Achille, Rishita Anubhai, Ci-
A light-weight transformer-based toolkit for multi- cero dos Santos Nogueira, Bing Xiang, and Stefano
lingual natural language processing. In Proceedings Soatto. 2021. Structured prediction as translation
of the 16th Conference of the European Chapter of between augmented natural languages. In Proceed-
the Association for Computational Linguistics: Sys- ings of the 9th International Conference on Learning
tem Demonstrations, pages 80–90, Online. Associa- Representations (ICLR).
tion for Computational Linguistics.
Yannis Papanikolaou and Andrea Pierleoni. 2020.
Thien Huu Nguyen, Kyunghyun Cho, and Ralph Gr- Dare: Data augmented relation extraction with gpt-2.
ishman. 2016. Joint event extraction via recurrent arXiv preprint arXiv:2004.13845.
neural networks. In Proceedings of the 2016 Con-
ference of the North American Chapter of the Asso- Loreto Parisi, Simone Francia, and Paolo Magnani.
ciation for Computational Linguistics: Human Lan- 2020. UmBERTo: an Italian language model trained
guage Technologies. with whole word masking. GitHub.
39
Dong Huk Park, Lisa Anne Hendricks, Zeynep Akata, Jay M. Ponte and W. Bruce Croft. 1998. A language
Anna Rohrbach, Bernt Schiele, Trevor Darrell, and modeling approach to information retrieval. In Pro-
Marcus Rohrbach. 2018. Multimodal explanations: ceedings of the 21st Annual International ACM SI-
Justifying decisions and pointing to the evidence. In GIR Conference on Research and Development in
Proceedings of 2018 IEEE Conference on Computer Information Retrieval.
Vision and Pattern Recognition (CVPR).
Jakob Prange, Nathan Schneider, and Vivek Sriku-
Debjit Paul and Anette Frank. 2021. Coins: Dynam- mar. 2021. Supertagging the long tail with tree-
ically generating contextualized inference rules for structured decoding of complex categories. Transac-
narrative story completion. In Proceedings of the tions of the Association for Computational Linguis-
59th Annual Meeting of the Association for Compu- tics, 9(0):243–260.
tational Linguistics (ACL).
Sai Prasanna, Anna Rogers, and Anna Rumshisky.
Baolin Peng, Chenguang Zhu, Michael Zeng, and Jian- 2020. When BERT Plays the Lottery, All Tickets
feng Gao. 2020. Data augmentation for spoken lan- Are Winning. In Proceedings of the 2020 Confer-
guage understanding via pretrained models. ence on Empirical Methods in Natural Language
Processing (EMNLP), pages 3208–3229, Online. As-
Ethan Perez, Douwe Kiela, and Kyunghyun Cho. 2021. sociation for Computational Linguistics.
True few-shot learning with language models. arXiv
preprint arXiv:2105.11447. Raul Puri and Bryan Catanzaro. 2019. Zero-shot
text classification with generative language models.
Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt
arXiv preprint arXiv:1912.10165.
Gardner, Christopher Clark, Kenton Lee, and Luke
Zettlemoyer. 2018. Deep Contextualized Word Rep- Guanghui Qin and Jason Eisner. 2021. Learning how
resentations. In Proceedings of the 2018 Confer- to ask: Querying LMs with mixtures of soft prompts.
ence of the North American Chapter of the Associ- In Proceedings of the 2021 Conference of the North
ation for Computational Linguistics: Human Lan- American Chapter of the Association for Computa-
guage Technologies, Volume 1 (Long Papers), pages tional Linguistics: Human Language Technologies,
2227–2237, New Orleans, Louisiana. Association pages 5203–5212, Online. Association for Compu-
for Computational Linguistics. tational Linguistics.
Fabio Petroni, Patrick Lewis, Aleksandra Piktus, Tim
Rocktäschel, Yuxiang Wu, Alexander H. Miller, and XiPeng Qiu, TianXiang Sun, YiGe Xu, YunFan Shao,
Sebastian Riedel. 2020. How context affects lan- Ning Dai, and XuanJing Huang. 2020. Pre-
guage models’ factual predictions. In AKBC. trained models for natural language processing: A
survey. Science China Technological Sciences,
Fabio Petroni, Tim Rocktäschel, Sebastian Riedel, 63(10):1872–1897.
Patrick Lewis, Anton Bakhtin, Yuxiang Wu, and
Alexander Miller. 2019. Language models as knowl- Alec Radford, Karthik Narasimhan, Tim Salimans, and
edge bases? In Proceedings of the 2019 Conference Ilya Sutskever. 2018. Improving Language Under-
on Empirical Methods in Natural Language Process- standing by Generative Pre-Training. OpenAI blog,
ing and the 9th International Joint Conference on page 12.
Natural Language Processing (EMNLP-IJCNLP).
Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
Jonas Pfeiffer, Andreas Rücklé, Clifton Poth, Aish- Dario Amodei, Ilya Sutskever, et al. 2019. Lan-
warya Kamath, Ivan Vulić, Sebastian Ruder, guage models are unsupervised multitask learners.
Kyunghyun Cho, and Iryna Gurevych. 2020a. OpenAI blog, 1(8):9.
AdapterHub: A framework for adapting transform-
ers. In Proceedings of the 2020 Conference on Em- Evani Radiya-Dixit and Xin Wang. 2020. How fine can
pirical Methods in Natural Language Processing: fine-tuning be? learning efficient language models.
System Demonstrations, pages 46–54, Online. Asso- In Proceedings of the Twenty Third International
ciation for Computational Linguistics. Conference on Artificial Intelligence and Statistics,
volume 108 of Proceedings of Machine Learning Re-
Jonas Pfeiffer, Ivan Vulić, Iryna Gurevych, and Se- search, pages 2435–2443. PMLR.
bastian Ruder. 2020b. MAD-X: An Adapter-Based
Framework for Multi-Task Cross-Lingual Transfer. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
In Proceedings of the 2020 Conference on Empirical Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
Methods in Natural Language Processing (EMNLP), Wei Li, and Peter J. Liu. 2020. Exploring the Lim-
pages 7654–7673, Online. Association for Computa- its of Transfer Learning with a Unified Text-to-Text
tional Linguistics. Transformer. In Journal of Machine Learning Re-
search.
Marco Polignano, Pierpaolo Basile, Marco Degemmis,
Giovanni Semeraro, and Valerio Basile. 2019. Al- Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018.
berto: Italian bert language understanding model for Know what you don’t know: Unanswerable ques-
nlp challenging tasks based on tweets. In CLiC-it. tions for squad. arXiv preprint arXiv:1806.03822.
40
Giulio Ravasio and Leonardo Di Perna. 2020. Devendra Sachan, Yuhao Zhang, Peng Qi, and
GilBERTo: An Italian pretrained language William L. Hamilton. 2021. Do syntax trees help
model based on roberta. https://github.com/idb- pre-trained transformers extract information? In
ita/GilBERTo. GitHub. Proceedings of the 16th Conference of the European
Chapter of the Association for Computational Lin-
Liliang Ren, Chenkai Sun, Heng Ji, and Julia Hock- guistics: Main Volume, pages 2647–2661, Online.
enmaier. 2021. HySPA: Hybrid span generation Association for Computational Linguistics.
for scalable text-to-graph extraction. In Findings of
the Association for Computational Linguistics (ACL- Ali Safaya, Moutasem Abdullatif, and Deniz Yuret.
IJCNLP). 2020. KUISAIL at SemEval-2020 task 12: BERT-
CNN for offensive speech identification in social me-
Laria Reynolds and Kyle McDonell. 2021. Prompt pro- dia. In Proceedings of the Fourteenth Workshop on
gramming for large language models: Beyond the Semantic Evaluation, pages 2054–2059, Barcelona
few-shot paradigm. In Extended Abstracts of the (online). International Committee for Computational
2021 CHI Conference on Human Factors in Com- Linguistics.
puting Systems, CHI EA ’21, New York, NY, USA.
Association for Computing Machinery. Oscar Sainz, Oier Lopez de Lacalle, Gorka Labaka, An-
der Barrena, and Eneko Agirre. 2021. Label verbal-
Marco Túlio Ribeiro, Sameer Singh, and Carlos ization and entailment for effective zero- and few-
Guestrin. 2016. ”why should I trust you?”: Explain- shot relation extraction. In Proceedings of the 2021
ing the predictions of any classifier. In Proceedings Conference on Empirical Methods in Natural Lan-
of the Demonstrations Session of 2016 Conference guage Processing (EMNLP), Punta Cana, Domini-
of the North American Chapter of the Association can Republic. Association for Computational Lin-
for Computational Linguistics: Human Language guistics.
Technologies (NAACL-HLT).
Oscar Sainz and German Rigau. 2021.
Marco Túlio Ribeiro, Sameer Singh, and Carlos Ask2Transformers: Zero-shot domain labelling
Guestrin. 2018. Semantically equivalent adversarial with pretrained language models. In Proceedings
rules for debugging NLP models. In Proceedings of of the 11th Global Wordnet Conference, pages
the 56th Annual Meeting of the Association for Com- 44–52, University of South Africa (UNISA). Global
putational Linguistics, (ACL). Wordnet Association.
Anna Rogers, Olga Kovaleva, and Anna Rumshisky. Julian Salazar, Davis Liang, Toan Q. Nguyen, and Ka-
2020. A primer in BERTology: What we know trin Kirchhoff. 2020. Masked language model scor-
about how BERT works. Transactions of the Associ- ing. In Proceedings of the 58th Annual Meeting
ation for Computational Linguistics, 8:842–866. of the Association for Computational Linguistics,
pages 2699–2712, Online. Association for Compu-
Subendhu Rongali, Luca Soldaini, Emilio Monti, and tational Linguistics.
Wael Hamza. 2020. Don’t parse, generate! a se-
quence to sequence architecture for task-oriented se- Victor Sanh, Lysandre Debut, Julien Chaumond, and
mantic parsing. In Proceedings of the International Thomas Wolf. 2020. Distilbert, a distilled version
World Wide Web Conference (WWW). of bert: smaller, faster, cheaper and lighter. arXiv
preprint arXiv:1910.01108.
Alexis Ross, Ana Marasovic, and Matthew E. Peters.
2021. Explaining NLP models via minimal con- Victor Sanh, Albert Webson, Colin Raffel, Stephen H.
trastive editing (mice). In Findings of the Associa- Bach, Lintang Sutawika, Zaid Alyafeai, Antoine
tion for Computational Linguistics: (ACL/IJCNLP). Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja,
Manan Dey, M Saiful Bari, Canwen Xu, Urmish
Hayley Ross, Jonathon Cai, and Bonan Min. 2020. Ex- Thakker, Shanya Sharma Sharma, Eliza Szczechla,
ploring Contextualized Neural Language Models for Taewoon Kim, Gunjan Chhablani, Nihal Nayak,
Temporal Dependency Parsing. In Proceedings of Debajyoti Datta, Jonathan Chang, Mike Tian-Jian
the 2020 Conference on Empirical Methods in Nat- Jiang, Han Wang, Matteo Manica, Sheng Shen,
ural Language Processing (EMNLP), pages 8548– Zheng Xin Yong, Harshit Pandey, Rachel Baw-
8553, Online. Association for Computational Lin- den, Thomas Wang, Trishala Neeraj, Jos Rozen,
guistics. Abheesht Sharma, Andrea Santilli, Thibault Fevry,
Jason Alan Fries, Ryan Teehan, Stella Biderman,
Dan Roth. 2017. Incidental supervision: Moving be- Leo Gao, Tali Bers, Thomas Wolf, and Alexan-
yond supervised learning. In Proc. of the Conferen- der M. Rush. 2021. Multitask prompted training en-
nce on Artificial Intelligence (AAAI). ables zero-shot task generalization. arXiv preprint
arXiv:2110.0820.
Rachel Rudinger, Vered Shwartz, Jena D. Hwang,
Chandra Bhagavatula, Maxwell Forbes, Ronan Cicero Nogueira dos Santos, Xiaofei Ma, Ramesh Nal-
Le Bras, Noah A. Smith, and Yejin Choi. 2020. lapati, Zhiheng Huang, and Bing Xiang. 2020. Be-
Thinking like a skeptic: Defeasible inference in nat- yond [CLS] through ranking by generation. In Pro-
ural language. In Findings of the Association for ceedings of the 2020 Conference on Empirical Meth-
Computational Linguistics (EMNLP). ods in Natural Language Processing (EMNLP).
41
Maarten Sap, Ronan LeBras, Emily Allaway, Chan- Roy Schwartz, Jesse Dodge, Noah A. Smith, and Oren
dra Bhagavatula, Nicholas Lourie, Hannah Rashkin, Etzioni. 2020. Green AI. Communications of the
Brendan Roof, Noah A. Smith, and Yejin Choi. 2019. ACM, 63(12):54–63.
Atomic: An atlas of machine commonsense for if-
then reasoning. arXiv preprint arXiv:1811.00146. Stefan Schweter. 2020. Berturk - bert models for turk-
ish.
Timo Schick, Helmut Schmid, and Hinrich Schütze.
2020. Automatically identifying words that can Amit Seker, Elron Bandel, Dan Bareket, Idan
serve as labels for few-shot text classification. In Brusilovsky, Refael Shaked Greenfeld, and Reut
Proceedings of the 28th International Conference Tsarfaty. 2021. Alephbert:a hebrew large pre-
on Computational Linguistics, pages 5569–5578, trained language model to start-off your hebrew nlp
Barcelona, Spain (Online). International Committee application with. arXiv preprint arXiv:2104.04052.
on Computational Linguistics. Siamak Shakeri, Cicero Nogueira dos Santos, Henghui
Zhu, Patrick Ng, Feng Nan, Zhiguo Wang, Ramesh
Timo Schick and Hinrich Schütze. 2020. BERTRAM: Nallapati, and Bing Xiang. 2020. End-to-end syn-
Improved word embeddings have big impact on con- thetic data generation for domain adaptation of ques-
textualized model performance. In Proceedings of tion answering systems. In Proceedings of the 2020
the 58th Annual Meeting of the Association for Com- Conference on Empirical Methods in Natural Lan-
putational Linguistics, pages 3996–4007, Online. guage Processing (EMNLP).
Association for Computational Linguistics.
Peng Shi and Jimmy Lin. 2019. Simple bert models for
Timo Schick and Hinrich Schütze. 2021a. Exploiting relation extraction and semantic role labeling. arXiv
cloze-questions for few-shot text classification and preprint arXiv:1904.05255.
natural language inference. In Proceedings of the
16th Conference of the European Chapter of the As- Richard Shin, Christopher Lin, Sam Thomson, Charles
sociation for Computational Linguistics: Main Vol- Chen, Subhro Roy, Emmanouil Antonios Platanios,
ume, pages 255–269, Online. Association for Com- Adam Pauls, Dan Klein, Jason Eisner, and Benjamin
putational Linguistics. Van Durme. 2021. Constrained language models
yield few-shot semantic parsers.
Timo Schick and Hinrich Schütze. 2021b. It’s not just
size that matters: Small language models are also Taylor Shin, Yasaman Razeghi, Robert L. Logan IV,
few-shot learners. In Proceedings of the 2021 Con- Eric Wallace, and Sameer Singh. 2020. AutoPrompt:
ference of the North American Chapter of the Asso- Eliciting Knowledge from Language Models with
ciation for Computational Linguistics: Human Lan- Automatically Generated Prompts. In Proceed-
guage Technologies, pages 2339–2352, Online. As- ings of the 2020 Conference on Empirical Methods
sociation for Computational Linguistics. in Natural Language Processing (EMNLP), pages
4222–4235, Online. Association for Computational
Timo Schick and Hinrich Schütze. 2019. Rare words: Linguistics.
A major problem for contextualized embeddings and
how to fix it by attentive mimicking. arXiv preprint Fábio Souza, Rodrigo Nogueira, and Roberto Lotufo.
arXiv:1904.06707. 2020a. BERTimbau: pretrained BERT models for
Brazilian Portuguese. In 9th Brazilian Conference
Timo Schick and Hinrich Schütze. 2020. Few-shot text on Intelligent Systems, BRACIS.
generation with pattern-exploiting training. arXiv
Fábio Souza, Rodrigo Nogueira, and Roberto Lotufo.
preprint arXiv:2012.11926.
2020b. Portuguese named entity recognition using
Timo Schick and Hinrich Schütze. 2021. Generating bert-crf. arXiv preprint arXiv:1909.10649.
datasets with pretrained language models. arXiv Asa Cooper Stickland and Iain Murray. 2019. BERT
preprint arXiv:2104.07540. and PALs: Projected Attention Layers for Efficient
Adaptation in Multi-Task Learning. In Proceedings
Timo Schick, Sahana Udupa, and Hinrich Schütze. of the 36th International Conference on Machine
2021. Self-diagnosis and self-debiasing: A pro- Learning, pages 5986–5995. PMLR. ISSN: 2640-
posal for reducing corpus-based bias in nlp. arXiv 3498.
preprint arXiv:2103.00453. To appear in TACL.
Chi Sun, Xipeng Qiu, Yige Xu, and Xuanjing Huang.
Tal Schuster, Ori Ram, Regina Barzilay, and Amir 2020. How to fine-tune bert for text classification?
Globerson. 2019. Cross-lingual alignment of con- arXiv preprint arXiv:1905.05583.
textual word embeddings, with applications to zero-
shot dependency parsing. In Proceedings of the Yu Sun, Shuohuan Wang, Shikun Feng, Siyu Ding,
2019 Conference of the North American Chapter of Chao Pang, Junyuan Shang, Jiaxiang Liu, Xuyi
the Association for Computational Linguistics: Hu- Chen, Yanbin Zhao, Yuxiang Lu, Weixin Liu, Zhi-
man Language Technologies, Volume 1 (Long and hua Wu, Weibao Gong, Jianzhong Liang, Zhizhou
Short Papers), pages 1599–1613, Minneapolis, Min- Shang, Peng Sun, Wei Liu, Xuan Ouyang, Dian-
nesota. Association for Computational Linguistics. hai Yu, Hao Tian, Hua Wu, and Haifeng Wang.
42
2021. Ernie 3.0: Large-scale knowledge enhanced Amir Pouran Ben Veyseh, Viet Lai, Franck Dernon-
pre-training for language understanding and genera- court, and Thien Huu Nguyen. 2021a. Unleash GPT-
tion. arXiv preprint arXiv:2107.02137. 2 power for event detection. In Proceedings of the
59th Annual Meeting of the Association for Compu-
Ehsan Taher, Seyed Abbas Hoseini, and Mehrnoush tational Linguistics and the 11th International Joint
Shamsfard. 2019. Beheshti-NER: Persian named Conference on Natural Language Processing (Vol-
entity recognition using BERT. In Proceedings of ume 1: Long Papers), pages 6271–6282.
The First International Workshop on NLP Solutions
for Under Resourced Languages (NSURL 2019) co- Amir Pouran Ben Veyseh, Minh Van Nguye, Bonan
located with ICNLSP 2019 - Short Papers, pages 37– Min, and Thien Huu Nguyen. 2021b. Augment-
42, Trento, Italy. Association for Computational Lin- ing open-domain event detection with synthetic data
guistics. from gpt-2. In Proceedings of the European Confer-
ence on Machine Learning and Principles and Prac-
Alon Talmor, Yanai Elazar, Yoav Goldberg, and tice of Knowledge Discovery in Databases.
Jonathan Berant. 2020. olmpics – on what lan- Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly.
guage model pre-training captures. arXiv preprint 2015. Pointer networks. Advances in Neural Infor-
arXiv:1912.13283. mation Processing Systems, 28:2692–2700.
Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019. Antti Virtanen, Jenna Kanerva, Rami Ilo, Jouni Luoma,
BERT rediscovers the classical NLP pipeline. In Juhani Luotolahti, Tapio Salakoski, Filip Ginter, and
Proceedings of the 57th Annual Meeting of the Asso- Sampo Pyysalo. 2019. Multilingual is not enough:
ciation for Computational Linguistics, pages 4593– Bert for finnish. arXiv preprint arXiv:1912.07076.
4601, Florence, Italy. Association for Computational
Linguistics. Wietse de Vries, Andreas van Cranenburgh, Arianna
Bisazza, Tommaso Caselli, Gertjan van Noord, and
James Thorne, Andreas Vlachos, Christos Malvina Nissim. 2019. Bertje: A dutch bert model.
Christodoulopoulos, and Arpit Mittal. 2018. arXiv preprint arXiv:1912.09582.
FEVER: a large-scale dataset for fact extraction
Sandra Wachter, Brent Mittelstadt, and Chris Russell.
and VERification. In Proceedings of the 2018
2018. Counterfactual explanations without opening
Conference of the North American Chapter of
the black box: Automated decisions and the gdpr.
the Association for Computational Linguistics:
Human Language Technologies, Volume 1 (Long Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner,
Papers), pages 809–819, New Orleans, Louisiana. and Sameer Singh. 2019. Universal adversarial trig-
Association for Computational Linguistics. gers for attacking and analyzing NLP. In Proceed-
ings of the 2019 Conference on Empirical Methods
Ke Tran, Arianna Bisazza, and Christof Monz. 2018. in Natural Language Processing and the 9th Inter-
The importance of being recurrent for modeling hi- national Joint Conference on Natural Language Pro-
erarchical structure. In Proceedings of the 2018 cessing (EMNLP-IJCNLP), pages 2153–2162, Hong
Conference on Empirical Methods in Natural Lan- Kong, China. Association for Computational Lin-
guage Processing, pages 4731–4736, Brussels, Bel- guistics.
gium. Association for Computational Linguistics.
Ben Wang. 2021. Mesh-Transformer-
Trieu H. Trinh and Quoc V. Le. 2019. A simple JAX: Model-Parallel Implementation of
method for commonsense reasoning. arXiv preprint Transformer Language Model with JAX.
arXiv:1806.02847. https://github.com/kingoflolz/mesh-transformer-
jax.
Maria Tsimpoukelli, Jacob Menick, Serkan Cabi,
S. M. Ali Eslami, Oriol Vinyals, and Felix Hill. 2021. Cunxiang Wang, Pai Liu, and Yue Zhang. 2021a. Can
Multimodal few-shot learning with frozen language generative pre-trained language models serve as
models. arXiv preprint arXiv:2106.13884. knowledge bases for closed-book QA? In Proceed-
ings of the 59th Annual Meeting of the Association
Ahmet Üstün, Arianna Bisazza, Gosse Bouma, and for Computational Linguistics and the 11th Interna-
Gertjan van Noord. 2020. UDapter: Language adap- tional Joint Conference on Natural Language Pro-
tation for truly Universal Dependency parsing. In cessing (Volume 1: Long Papers), pages 3241–3251,
Proceedings of the 2020 Conference on Empirical Online. Association for Computational Linguistics.
Methods in Natural Language Processing (EMNLP), Sinong Wang, Han Fang, Madian Khabsa, Hanzi Mao,
pages 2302–2315, Online. Association for Computa- and Hao Ma. 2021b. Entailment as few-shot learner.
tional Linguistics. arXiv preprint arXiv:2104.14690.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Xinyu Wang, Yong Jiang, Nguyen Bach, Tao Wang,
Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Zhongqiang Huang, Fei Huang, and Kewei Tu.
Kaiser, and Illia Polosukhin. 2017. Attention is All 2021c. Automated concatenation of embed-
you Need. In Advances in Neural Information Pro- dings for structured prediction. arXiv preprint
cessing Systems, volume 30. Curran Associates, Inc. arXiv:2010.05006.
43
Xinyu Wang and Kewei Tu. 2020. Second-order neural Quentin Lhoest, and Alexander Rush. 2020. Trans-
dependency parsing with message passing and end- formers: State-of-the-art natural language process-
to-end training. In Proceedings of the 1st Confer- ing. In Proceedings of the 2020 Conference on Em-
ence of the Asia-Pacific Chapter of the Association pirical Methods in Natural Language Processing:
for Computational Linguistics and the 10th Interna- System Demonstrations, pages 38–45, Online. Asso-
tional Joint Conference on Natural Language Pro- ciation for Computational Linguistics.
cessing, pages 93–99, Suzhou, China. Association
for Computational Linguistics. Ledell Wu, Fabio Petroni, Martin Josifoski, Sebastian
Riedel, and Luke Zettlemoyer. 2020a. Scalable zero-
Alex Warstadt, Alicia Parrish, Haokun Liu, Anhad Mo- shot entity linking with dense entity retrieval. arXiv
hananey, Wei Peng, Sheng-Fu Wang, and Samuel R. preprint arXiv:1911.03814.
Bowman. 2020a. BLiMP: A benchmark of linguis- Shanchan Wu and Yifan He. 2019. Enriching
tic minimal pairs for English. In Proceedings of the pre-trained language model with entity informa-
Society for Computation in Linguistics 2020, pages tion for relation classification. arXiv preprint
409–410, New York, New York. Association for arXiv:1905.0828.
Computational Linguistics.
Tongshuang Wu, Marco Tulio Ribeiro, Jeffrey Heer,
Alex Warstadt, Amanpreet Singh, and Samuel R. Bow- and Daniel S. Weld. 2021a. Polyjuice: Generating
man. 2019. CoLA: The Corpus of Linguistic Ac- counterfactuals for explaining, evaluating, and im-
ceptability (with added annotations). http://nyu- proving models. arXiv preprint arXiv:2101.00288.
mll.github.io/cola.
Wei Wu, Fei Wang, Arianna Yuan, Fei Wu, and Ji-
Alex Warstadt, Yian Zhang, Xiaocheng Li, Haokun wei Li. 2020b. CorefQA: Coreference resolution as
Liu, and Samuel R. Bowman. 2020b. Learning query-based span prediction. In Proceedings of the
which features matter: RoBERTa acquires a prefer- 58th Annual Meeting of the Association for Compu-
ence for linguistic generalizations (eventually). In tational Linguistics, pages 6953–6963, Online. As-
Proceedings of the 2020 Conference on Empirical sociation for Computational Linguistics.
Methods in Natural Language Processing (EMNLP), Zhaofeng Wu, Hao Peng, and Noah A. Smith. 2021b.
pages 217–235, Online. Association for Computa- Infusing finetuning with semantic dependencies.
tional Linguistics. Transactions of the Association of Computational
Linguistics, 9:226–242.
Albert Webson and Ellie Pavlick. 2021. Do prompt-
based models really understand the meaning of their Dongqin Xu, Junhui Li, Muhua Zhu, Min Zhang, and
prompts? arXiv preprint arXiv:2109.01247. Guodong Zhou. 2020. Improving AMR parsing
with sequence-to-sequence pre-training. In Proceed-
Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin ings of the 2020 Conference on Empirical Methods
Guu, Adams Wei Yu, Brian Lester, Nan Du, An- in Natural Language Processing (EMNLP), pages
drew M. Dai, and Quoc V. Le. 2021. Fine- 2501–2511, Online. Association for Computational
tuned Language Models Are Zero-Shot Learners. Linguistics.
arXiv:2109.01652 [cs]. ArXiv: 2109.01652.
Zenan Xu, Daya Guo, Duyu Tang, Qinliang Su, Linjun
Rongxiang Weng, Heng Yu, Shujian Huang, Shanbo Shou, Ming Gong, Wanjun Zhong, Xiaojun Quan,
Cheng, and Weihua Luo. 2020. Acquiring knowl- Daxin Jiang, and Nan Duan. 2021. Syntax-enhanced
edge from pre-trained model to neural machine pre-trained model. In Proceedings of the 59th An-
translation. In Proceedings of the AAAI Conference nual Meeting of the Association for Computational
on Artificial Intelligence, volume 34, pages 9266– Linguistics and the 11th International Joint Confer-
9273. ence on Natural Language Processing (Volume 1:
Long Papers), pages 5412–5422, Online. Associa-
Adina Williams, Nikita Nangia, and Samuel Bowman. tion for Computational Linguistics.
2018. A broad-coverage challenge corpus for sen- Hang Yan, Junqi Dai, Tuo Ji, Xipeng Qiu, and Zheng
tence understanding through inference. In Proceed- Zhang. 2021a. A unified generative framework for
ings of the 2018 Conference of the North American aspect-based sentiment analysis. In Proceedings of
Chapter of the Association for Computational Lin- the 59th Annual Meeting of the Association for Com-
guistics: Human Language Technologies, Volume putational Linguistics and the 11th International
1 (Long Papers), pages 1112–1122, New Orleans, Joint Conference on Natural Language Processing
Louisiana. Association for Computational Linguis- (Volume 1: Long Papers).
tics.
Hang Yan, Tao Gui, Junqi Dai, Qipeng Guo, Zheng
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Zhang, and Xipeng Qiu. 2021b. A unified genera-
Chaumond, Clement Delangue, Anthony Moi, Pier- tive framework for various NER subtasks. In Pro-
ric Cistac, Tim Rault, Remi Louf, Morgan Funtow- ceedings of the 59th Annual Meeting of the Associa-
icz, Joe Davison, Sam Shleifer, Patrick von Platen, tion for Computational Linguistics and the 11th In-
Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, ternational Joint Conference on Natural Language
Teven Le Scao, Sylvain Gugger, Mariama Drame, Processing (ACL).
44
Kaiyu Yang and Jia Deng. 2020. Strongly incremen- models with syntactic agreement tests. In Proceed-
tal constituency parsing with graph neural networks. ings of the 5th Swiss Text Analytics Conference and
arXiv preprint arXiv:2010.14568. the 16th Conference on Natural Language Process-
ing, SwissText/KONVENS 2020, Zurich, Switzerland,
Yiben Yang, Chaitanya Malaviya, Jared Fernandez, June 23-25, 2020, volume abs/2007.03765, Zurich,
Swabha Swayamdipta, Ronan Le Bras, Ji-Ping Switzerland. CEUR Workshop Proceedings.
Wang, Chandra Bhagavatula, Yejin Choi, and Doug
Downey. 2020. G-daug: Generative data augmen- Elad Ben Zaken, Shauli Ravfogel, and Yoav Gold-
tation for commonsense reasoning. In Findings berg. 2021. Bitfit: Simple parameter-efficient
of the Association for Computational Linguistics fine-tuning for transformer-based masked language-
(EMNLP). models. arXiv preprint arXiv:2106.10199.
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car- Danqing Zhang, Tao Li, Haiyang Zhang, and Bing
bonell, Russ R Salakhutdinov, and Quoc V Le. 2019. Yin. 2020a. On data augmentation for ex-
XLNet: Generalized Autoregressive Pretraining for treme multi-label classification. arXiv preprint
Language Understanding. In Advances in Neural In- arXiv:2009.10778.
formation Processing Systems, volume 32. Curran
Associates, Inc. Haoyu Zhang, Jingjing Cai, Jianjun Xu, and Ji Wang.
2019a. Pretraining-based natural language gener-
Jiarui Yao, Haoling Qiu, Jin Zhao, Bonan Min, and Ni- ation for text summarization. In Proceedings of
anwen Xue. 2021. Factuality assessment as modal the 23rd Conference on Computational Natural Lan-
dependency parsing. In Proceedings of the 59th An- guage Learning (CoNLL), pages 789–797, Hong
nual Meeting of the Association for Computational Kong, China. Association for Computational Lin-
Linguistics and the 11th International Joint Confer- guistics.
ence on Natural Language Processing (Volume 1:
Long Papers), pages 1540–1550, Online. Associa- Jeffrey O Zhang, Alexander Sax, Amir Zamir,
tion for Computational Linguistics. Leonidas Guibas, and Jitendra Malik. 2020b.
Side-tuning: A baseline for network adapta-
Wenpeng Yin, Jamaal Hay, and Dan Roth. 2019. tion via additive side networks. arXiv preprint
Benchmarking zero-shot text classification: arXiv:1912.13503.
Datasets, evaluation and entailment approach.
In Proceedings of the 2019 Conference on Empiri- Sheng Zhang, Xutai Ma, Kevin Duh, and Benjamin
cal Methods in Natural Language Processing and Van Durme. 2019b. AMR parsing as sequence-to-
the 9th International Joint Conference on Natural graph transduction. In Proceedings of the 57th An-
Language Processing (EMNLP-IJCNLP), pages nual Meeting of the Association for Computational
3914–3923, Hong Kong, China. Association for Linguistics, pages 80–94, Florence, Italy. Associa-
Computational Linguistics. tion for Computational Linguistics.
Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod
Wenxuan Zhang, Xin Li, Yang Deng, Lidong Bing, and
Lipson. 2014. How transferable are features in deep
Wai Lam. 2021a. Towards generative aspect-based
neural networks? In Advances in Neural Informa-
sentiment analysis. In Proceedings of the 59th An-
tion Processing Systems, volume 27. Curran Asso-
nual Meeting of the Association for Computational
ciates, Inc.
Linguistics and the 11th International Joint Confer-
Jianfei Yu, Chenggong Gong, and Rui Xia. 2021a. ence on Natural Language Processing (ACL).
Cross-domain review generation for aspect-based
sentiment analysis. In Findings of the Association Yian Zhang, Alex Warstadt, Xiaocheng Li, and
for Computational Linguistics (ACL-IJCNLP). Samuel R. Bowman. 2021b. When do you need
billions of words of pretraining data? In Proceed-
Wenhao Yu, Chenguang Zhu, Zaitang Li, Zhiting Hu, ings of the 59th Annual Meeting of the Association
Qingyun Wang, Heng Ji, and Meng Jiang. 2021b. for Computational Linguistics and the 11th Interna-
A survey of knowledge-enhanced text generation. tional Joint Conference on Natural Language Pro-
arXiv preprint arXiv:2010.04389. cessing (Volume 1: Long Papers), pages 1112–1125,
Online. Association for Computational Linguistics.
Xiaodong Yu, Wenpeng Yin, and Dan Roth. 2020.
Paired representation learning for event and entity Yu Zhang, Houquan Zhou, and Zhenghua Li. 2020c.
coreference. arXiv preprint arXiv:2010.12808. Fast and accurate neural crf constituency parsing.
Proceedings of the Twenty-Ninth International Joint
Weizhe Yuan, Graham Neubig, and Pengfei Liu. 2021. Conference on Artificial Intelligence.
Bartscore: Evaluating generated text as text genera-
tion. arXiv preprint arXiv:2106.11520. Zhuosheng Zhang, Yuwei Wu, Hai Zhao, Zuchao Li,
Shuailiang Zhang, Xi Zhou, and Xiang Zhou. 2020d.
Karolina Zaczynska, Nils Feldhus, Robert Schwarzen- Semantics-aware BERT for language understanding.
berg, Aleksandra Gabryszak, and Sebastian Möller. In The Thirty-Fourth AAAI Conference on Artificial
2020. Evaluating German transformer language Intelligence (AAAI).
45
Mengjie Zhao, Tao Lin, Fei Mi, Martin Jaggi, and Hin- the AAAI Conference on Artificial Intelligence, vol-
rich Schütze. 2020a. Masking as an efficient alter- ume 35, pages 14638–14646.
native to finetuning for pretrained language models.
In Proceedings of the 2020 Conference on Empirical Jinhua Zhu, Yingce Xia, Lijun Wu, Di He, Tao Qin,
Methods in Natural Language Processing (EMNLP), Wengang Zhou, Houqiang Li, and Tie-Yan Liu.
pages 2226–2241, Online. Association for Computa- 2020. Incorporating bert into neural machine trans-
tional Linguistics. lation. arXiv preprint arXiv:2002.06823.
46
A PLMs for specialized domains or
languages
Table 6 shows PLMs for special domains. Table 7
presents PLMs pre-trained on different languages.
47
Model Domain Training Sources
S CI BERT (Beltagy et al., 2019) Science Scientific articles in computer science and biomedicine
B IO BERT (Lee et al., 2019) Biomedical Biomedical publications (abstracts and full-text articles)
C LINICAL BERT (Huang et al., 2020), Clinical Clinical notes
Alsentzer et al. (2019)
L EGAL BERT (Chalkidis et al., 2020) Legal Legal documents (e.g. contracts)
C ODE BERT (Feng et al., 2020b), Source code GitHub repositories
C ODEX (Chen et al., 2021b)
BERTweet (Nguyen et al., 2020a), Al- Twitter Tweets
BERTo (for Italian; Polignano et al., 2019)
BabyBERTa (Huebner et al., 2021) Child-directed speech Child-directed speech transcriptions
Language Model
Arabic Arabic-BERT (Safaya et al., 2020)
Basque BERTeus (Agerri et al., 2020)
Chinese MacBERT (Cui et al., 2020)
Dutch BERTje (de Vries et al., 2019), RobBERT (Delobelle et al., 2020)
Farsi ParsBERT (Farahani et al., 2021)
Finnish FinBERT (Virtanen et al., 2019)
French CamemBERT (Martin et al., 2020), FlauBERT (Le et al., 2020)
German GBERT and GELECTRA (Chan et al., 2020)
Hebrew HeBERT (Chriqui and Yahav, 2021), AlphaBERT (Seker et al., 2021)
Italian GilBERTo (Ravasio and Perna, 2020), UmBERTo (Parisi et al., 2020)
Japanese Japanese BERT (Inui Laboratory, 2021)
Portuguese BERTimbau (Souza et al., 2020a)
Russian RuBERT (Kuratov and Arkhipov, 2019)
Spanish BETO (Cañete et al., 2020)
Turkish BERTurk (Schweter, 2020)
48
Task Work PLM
(1) Contextual Embeddings
Word Sense disambiguation/induction Hadiwinoto et al. (2019), Amrami and Goldberg (2019) BERT
Coreference resolution (Lee et al., 2018) ElMO
Constituency parsing (Zhang et al., 2020c) BERT
Yang and Deng (2020) XLNet, BERT
Zhou and Zhao (2019) ELMo, BERT
Constituency parsing, Dependency parsing (Mrini et al., 2020) XLNet, BERT
Dependency parsing Wang and Tu (2020) BERT
Schuster et al. (2019) ElMo
Semantic role labeling (He et al., 2018) ElMO
AMR parsing Cai and Lam (2020), Xu et al. (2020) BERT
UCCA parsing (Jiang et al., 2019) BERT, mBERT
Commonsense reasoning ATOMIC (Sap et al., 2019) ELMo
Machine Translation Zhu et al. (2020) BERT
Text summarization (Zhang et al., 2019a) BERT
(2) Fine-tuning the PLM
Text classification (Yang et al., 2019) XLNet
(Sun et al., 2020) BERT
(Peters et al., 2018) ElMO
Semantic textual similarity (Yang et al., 2019) XLNet
NER ELMo (Peters et al., 2018) ELMo
NER, QA, Textual Entailement (TE) Devlin et al. (2019) BERT
TE (Liu et al., 2019) RoBERTa
Entity Linking Broscheit (2019), Ling et al. (2020), (Wu et al., 2020a) BERT
Relation extraction (Baldini Soares et al., 2019; Wu and He, 2019; Shi and Lin, 2019) BERT
Intent Detection and Slot Filling (Chen et al., 2019) BERT
XLNet (Yang et al., 2019) XLNet
Text generation (Kale and Rastogi, 2020) T5
Coreference resolution (Joshi et al., 2019) BERT
(Yu et al., 2020) RoBERTa
Text simplification (Martin et al., 2021) BART/mBART
(Raffel et al., 2020) T5
Dialogue (Hosseini-Asl et al., 2020) GPT-2
Semantic role labeling (Shi and Lin, 2019) BERT
Text summarization (Liu and Lapata, 2019) BERT
(Lewis et al., 2020) BART
Commonsense reasoning COMET (Bosselut et al., 2019) GPT
ATOMIC2020 (Hwang et al., 2020) GPT2
Machine Translation (Liu et al., 2020c) mBART
(Lample and Conneau, 2019) XLM
(3) Fine-tuning Customized Models
NER, POS tagging, dependency parsing, aspect extraction ACE (Wang et al., 2021c) ELMo, (m)BERT, XLM-R
Semantic parsing (Che et al., 2019) BERT
Temporal relation extraction (Ross et al., 2020) BERT
Text simplification (Omelianchuk et al., 2021) RoBERTa
Text simplification, summarization (Malmi et al., 2019) BERT
Coreference resolution CorefQA (Wu et al., 2020b) SpanBERT
Machine Translation (Weng et al., 2020) BERT, GPT
(Ma et al., 2020) XLM-R
CCG parsing Tree-structured supertagger (Prange et al., 2021) RoBERTa-base
(4) Efficient Fine-tuning Approaches
BitFit (Zaken et al., 2021)
Adapter-Transformer (Pfeiffer et al., 2020a)
POS tagging, dependency parsing Trankit (Nguyen et al., 2021) XLM-R
Table 8: A summary of prior work organized by the strategies in the first paradigm “pretrain then fine-tune”.
49