0% found this document useful (0 votes)
22 views19 pages

Semi Supervised Learning Langauge Model

This paper investigates the effectiveness of Task-Adaptive Pre-training (TAPT) compared to Self-training (ST) approaches in semi-supervised learning (SSL) for natural language processing tasks. The authors find that TAPT outperforms five state-of-the-art ST methods, demonstrating greater robustness, especially with limited unlabelled data and in the presence of domain shifts. The study suggests that leveraging unsupervised signals from unlabelled texts is a promising alternative to relying on pseudo labels.

Uploaded by

vvbvansh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views19 pages

Semi Supervised Learning Langauge Model

This paper investigates the effectiveness of Task-Adaptive Pre-training (TAPT) compared to Self-training (ST) approaches in semi-supervised learning (SSL) for natural language processing tasks. The authors find that TAPT outperforms five state-of-the-art ST methods, demonstrating greater robustness, especially with limited unlabelled data and in the presence of domain shifts. The study suggests that leveraging unsupervised signals from unlabelled texts is a promising alternative to relying on pseudo labels.

Uploaded by

vvbvansh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Rethinking Semi-supervised Learning with Language Models

1∗
Zhengxiang Shi Francesco Tonolini2 Nikolaos Aletras2,3 Emine Yilmaz1,2
Gabriella Kazai2 Yunlong Jiao2
1
University College London, London, United Kingdom
2
Amazon, London, United Kingdom
3
University of Sheffield, Sheffield, United Kingdom
[Link].19@[Link]
{tonolini,eminey,aletras,gkazai,jyunlong}@[Link]

Abstract on task-specific labelled data offer large perfor-


mance gains across NLP tasks. Semi-supervised
Semi-supervised learning (S SL) is a popu- learning (S SL) (Grandvalet and Bengio, 2004;
lar setting aiming to effectively utilize unla-
arXiv:2305.13002v1 [[Link]] 22 May 2023

Chapelle et al., 2009; Kipf and Welling, 2017) is


belled data to improve model performance
a powerful and effective approach to utilize unla-
in downstream natural language processing
(NLP) tasks. Currently, there are two pop- belled data. A typical S SL setting assumes access
ular approaches to make use of unlabelled to a (relatively small) labelled training set and an
data: Self-training (S T) and Task-adaptive pre- (often large) unlabelled set. The goal of S SL is
training (TAPT). S T uses a teacher model to make effective use of the unlabelled data to im-
to assign pseudo-labels to the unlabelled data, prove model (i.e. LMs) performance.
while TAPT continues pre-training on the un-
In NLP, Self-training (S T) approaches have been
labelled data before fine-tuning. To the best
of our knowledge, the effectiveness of TAPT proposed to produce pseudo labels for unlabelled
in S SL tasks has not been systematically stud- examples to train the model (e.g. in Yarowsky,
ied, and no previous work has directly com- 1995; McClosky et al., 2006). With the advent
pared TAPT and S T in terms of their ability to of neural networks, S T approaches typically focus
utilize the pool of unlabelled data. In this pa- on using student-teacher models to assign pseudo-
per, we provide an extensive empirical study labels to the unlabelled data (e.g. in Artetxe et al.,
comparing five state-of-the-art S T approaches
2018; Cai and Lapata, 2019; Dong and de Melo,
and TAPT across various NLP tasks and data
sizes, including in- and out-of-domain settings.
2019; Xie et al., 2020a; Gera et al., 2022). Apart
Surprisingly, we find that TAPT is a strong from the sophisticated S T approaches, Gururangan
and more robust S SL learner, even when us- et al. (2020) proposed task adaptive pre-training
ing just a few hundred unlabelled samples or (TAPT), which is a straightforward yet effective
in the presence of domain shifts, compared to method for utilising unlabelled examples. This
more sophisticated S T approaches, and tends method involves continuing pre-training the LM
to bring greater improvements in S SL than in on the task-specific data without using labels, be-
fully-supervised settings. Our further analysis
fore proceeding with fully-supervised fine-tuning.
demonstrates the risks of using ST approaches
when the size of labelled or unlabelled data is TAPT and S T are both motivated by the need for
small or when domain shifts exist. We offer a effectively leveraging unlabelled examples, rais-
fresh perspective for future S SL research, sug- ing the questions of how TAPT performs in S SL
gesting the use of unsupervised pre-training tasks, as well as how these two approaches perform
objectives over dependency on pseudo labels.1 against each other.
In this work, we investigate the performance of
1 Introduction TAPT against five state-of-the-art S T approaches
Pre-training (P T) language models (LMs) (De- across five NLP tasks (§4). We empirically show
vlin et al., 2019; Liu et al., 2019; Radford et al., that TAPT outperforms all state-of-the-art S T ap-
2019) over large amounts of text data (e.g. with proaches on several tasks, suggesting that it should
masked language modelling) and then fine-tuning serve as a strong baseline for S SL methods. Previ-
ous research (Gururangan et al., 2020) has shown

This work was done during an internship at Amazon, that TAPT can improve performance in fully-
Alexa Shopping.
1
Code is available at [Link] supervised settings. Our study goes further by
pretraining-or-self-training. showing that TAPT can be even more effective
in S SL settings (§4). make predictions for unlabelled examples, and train
We next study the impact of using different a new student model with these predictions. For-
amounts of labelled and unlabelled data for S SL mally, let L , {(x1 , y1 ), . . . , (xn , yn )} denote n
(§5). Our experiments show that S T approaches labelled examples and U , {x˜1 , . . . , x˜m } denote
are prone to suffering from insufficient labelled or m unlabelled examples, where usually m  n.
unlabelled data, while TAPT is more robust across The S T framework is trained with three main steps
different combinations of labelled and unlabelled as follows.
data sizes. Contrary to the common assumption
Step 1. A teacher model F , parameterized by a
that TAPT requires a large amount of data to per-
neural network Θ, is trained via minimizing the
form well (e.g. Li et al., 2021b; Hou et al., 2022),
cross entropy loss ` on labelled examples L:
our results show that TAPT improves performance
with just a hundred unlabelled samples. We con- X
Lteacher (L) = `(yi , F (xi , Θ)), (1)
duct further analysis on the impact of domain shifts
xi ,yi ∈L
in labelled or unlabelled data. While S T approaches
generally suffer from domain shifts, TAPT is more
robust and even benefits from domain shifts (§6). Step 2. The teacher model F is used to make
In summary, the main contributions of this paper predictions (referred to as “pseudo-labels”) on un-
are as follows: labelled examples U :
• An extensive empirical study to directly com-
pare five state-of-the-art S T approaches and y˜i = F (x̃i , Θ), (2)
TAPT across various NLP tasks in S SL, with
where y˜i can be either the continuous logit or the
varying amounts of labelled and unlabelled
discrete label induced by an A RG M AX operation.
data as well as the effect of domain shifts;
Step 3. A student model G, parameterized by a
• Practical insights learned about the limitations
fresh neural network Φ, is trained to fit labelled and
of S T approaches, alongside an exploration of
pseudo-labelled examples:
the often-unrecognized yet impressive capac-
ity of TAPT as a simple, stable and powerful X
S SL learner; Lstudent (L, U ) = `(yi , F (xi , Φ))
xi ,yi ∈L
• A fresh perspective for future S SL research +
X
`(y˜i , F (x̃i , Φ)) (3)
by demonstrating that leveraging unsuper- x˜i ,y˜i ∈U
vised signals from unlabelled texts presents a
promising and effective approach alternative This process is repeated for a given number of
to dependence on pseudo labels. times by treating the student as a new teacher to
re-predict pseudo-labels as in eq. (2) and then train-
2 Preliminaries ing a new student with eq. (3). In practice, S T
2.1 Task Adaptive Pre-training (TAPT) with techniques such as consistency regularization
(Miyato et al., 2018; Clark et al., 2018; Berthelot
LMs are adapted to downstream NLP tasks by fine-
et al., 2019b), strong data augmentation (Sohn et al.,
tuning (F T) on task-specific data. TAPT introduces
2020; Xie et al., 2020b,a), confidence threshold
a simple additional step before fine-tuning by con-
(Sohn et al., 2020; Zhang et al., 2021; Berthelot
tinuing pre-training with a masked language mod-
et al., 2022) usually leads to substantial improve-
elling (M LM) objective (Devlin et al., 2019; Liu
ments in model performance.
et al., 2019) on the task-specific data without re-
quiring labels. The main advantage of TAPT is that 3 Experimental Setup
it provides a simple way for the LM to explore
the task space while it can easily make use of all Datasets. We experiment with five datasets used
available labelled and unlabelled data. in previous related work for S SL (Gururangan et al.,
2019; Chen et al., 2020b; Xie et al., 2020a; Li
2.2 Self-training (S T) et al., 2021a; Gera et al., 2022), including IMDB
The core idea behind S T approaches is to utilise (Maas et al., 2011), SST-2 (Wang et al., 2018), AG
a teacher model trained on labelled examples to N EWS (Zhang et al., 2015), A MAZON R EVIEW
Dataset Task Type Train Size Dev. Size Test Size |Y| L
IMDB (Maas et al., 2011) Movie Review Sentiment 23,000 2,000 25,000 2 149
SST-2 (Wang et al., 2018) Movie Review Sentiment 60,000 7,349 872 2 37
AG N EWS (Zhang et al., 2015) News Topic Classification 100,000 10,000 7,600 4 134
A MAZON R EVIEW (McAuley and Leskovec, 2013) Product Review Sentiment 250,000 25,000 650,000 5 79
YAHOO ! A NSWER (Chang et al., 2008) Topic Classification 500,000 50,000 60,000 10 32

Table 1: Statistics of datasets. |Y|: # of classes for classification tasks. L: average # of words in input sentence(s).
Note that we only sample examples from the original training set in our experiments.

(McAuley and Leskovec, 2013), and YAHOO ! A N - (F ULLY-S UPERVISED).


SWER (Chang et al., 2008). Table 1 shows data
statistics. We also provide descriptions and ex- 4 S T vs TAPT
amples of datasets in Appendix §A.1. We show
the process for quantifying the similarity between Overview. Table 2 shows the performance of
datasets in Appendix §A.2. Adhering to previous TAPT against five state-of-the-art S T approaches
work (e.g. Chen et al., 2020b; Wang et al., 2022), and the baselines (S UPERVISED and F ULLY-
we sample the same amount of labelled data per S UPERVISED) across five datasets, each with two
class from the train set, given the labelled size, to different sizes of labelled data for training follow-
form the labelled set. We re-sample the labelled ing Wang et al. (2022). Overall, we observe that:
data using the same five seeds for all different ap- (1) TAPT achieves highly competitive results com-
proaches and report the average performance with pared with state-of-the-art S T approaches; and (2)
an error bar. TAPT gains more improvement compared to the
S UPERVISED baselines when using fewer labelled
TAPT. Our approach to task adaptive pre- samples.
training (TAPT) using RO BERTA - BASE (Liu et al., For our first finding, the experimental results
2019) is to further pre-train on the training text show that TAPT outperforms all five state-of-the-
corpus including labelled and unlabelled data (see art S T approaches with lower variances on A MA -
Table 12 in Appendix for hyperparameter details). ZON R EVIEW , and YAHOO ! A NSWER , as shown
The model is then fine-tuned on the labelled data in Table 2. For example, TAPT obtains a F1 score
where the [CLS] token representation is passed to of 68.8% compared to the best S T approach’s F1
an extra feed-forward layer for classification (see score of 68.0% (using 500 labelled samples) and
Table 13 in Appendix for hyperparameter details). 71.5% compared to S T’s 69.6% (using 2000 la-
The process of TAPT + F INE - TUNING is simply belled samples) on YAHOO ! A NSWER. For an ex-
denoted by TAPT henceforth. ample of the second finding, TAPT gains 3.6% F1
S T. We implement five state-of-the-art S T ap- improvement over S UPERVISED (using 20 labelled
proaches, including VAT (Miyato et al., 2018), Fix- samples) compared to 2.2% (using 100 labelled
Match (Sohn et al., 2020), Dash (Xu et al., 2021b), samples) on IMDB. Below we delve deeper into
FlexMatch (Zhang et al., 2021), and AdaMatch these two findings and discuss them in more detail.
(Berthelot et al., 2022) (see descriptions of these
#1. TAPT is a strong semi-supervised learner
approaches in Appendix §B). We use RO BERTA -
and can outperform state-of-the-art S T ap-
BASE as the backbone, and the [CLS] token represen-
proaches. Figure 1 shows how the performance
tation with an extra feed-forward layer is used for
of S T, TAPT, and S UPERVISED vary with respect to
classification (see Table 14 in Appendix for hyper-
five different labelled sizes on each dataset, where
parameter details). Adhering to previous work (Xie
two latest S T approaches (A DA M ATCH and F LEX -
et al., 2020a; Wang et al., 2022), back-translation
M ATCH) are selected as representatives for S T. Ex-
(Ott et al., 2019) is used for data augmentation.
perimental results further verify that TAPT has a
Baselines. For reference, we also evaluate two consistent advantage over A DA M ATCH and F LEX -
baseline models that are only fine-tuned (from M ATCH across different labelled sizes on A MAZON
an off-the-shelf RO BERTA - BASE checkpoint) on: R EVIEW and YAHOO ! A NSWER. It is also worth
(1) the same labelled set as TAPT and S T noting that, while TAPT brings a stable improve-
(S UPERVISED); and (2) the whole training set ment over S UPERVISED across all datasets with
IMDB SST-2 AG N EWS A MAZON R EVIEW YAHOO ! A NSWER
Method
20 100 40 100 40 200 250 1000 500 2000
S T Approaches
VAT 90.20.9 92.00.4 . 75.012.0 86.23.4 87.51.0 89.50.7 52.21.3 57.50.2 66.90.5 68.60.2
F IX M ATCH 93.40.1 93.40.1 37.38.5 66.421.3 75.68.7 88.80.6 55.91.1 59.00.5 67.51.0 69.60.4
DASH 93.20.3 93.40.2 38.210.1 73.318.6 74.36.6 88.50.6 56.61.8 59.30.2 67.61.0 69.50.3
F LEX M ATCH 93.30.1 93.40.1 40.67.7 83.08.3 80.64.4 88.20.5 54.93.9 58.80.4 66.60.7 68.70.4
A DA M ATCH 94.40.4 . 94.70.2 42.613.3 83.14.4 82.75.9 88.60.4 55.52.8 59.00.7 68.00.7 69.50.3
S UPERVISED 83.37.4 88.70.2 74.76.1 84.02.7 84.61.6 88.00.8 53.10.7 57.20.1 65.40.3 68.50.3
+ TAPT 86.92.8 90.90.6 82.64.0 85.42.4 84.01.3 88.70.7 58.40.7 60.60.1 68.80.7 71.50.3
F ULLY-S UPERVISED 93.90.1 93.00.6 94.80.1 65.00.2 75.30.2
+ TAPT 94.00.2 93.50.3 95.00.1 65.60.1 75.40.1

Table 2: Performance of TAPT, S T approaches and the baselines across five datasets using two different sizes of
the training labelled data. We report average Macro-F1 on the test set across five seeds, with standard deviations
in subscripts. Blue and orange represent the best and second-best performance in a column respectively.

(a) IMDB (b) SST-2 (c) AG NEWS 66


(d) AMAZON REVIEW (e) YAHOO ANSWERS
94 94
92 64 74
92 92
Macro-F1 (%)

90 62 72
90
90 60
88 88
88 70
86 58 adamatch
86 86 84 68 flexmatch
56
supervised
84 84 82 54 66 tapt
80
102 103 104 102 103 104 102 103 104 103 104 105 103 104 105
Labelled Size Labelled Size Labelled Size Labelled Size Labelled Size

Figure 1: The effect of labelled size on TAPT and S T. Average test Macro-F1 score over 5 seeds is reported. From
the left to the right, TAPT and S T utilizes 23k, 60k, 100k, 250k, and 500k unlabelled samples respectively.

varying labelled sizes, S T can sometimes bring Improvement from TAPT


more substantial improvement, for example when 10 Imdb
SST-2
Macro-F1 Improvement (%)

only a few hundreds of labelled samples are avail- Ag News


8 Amazon Review
able from IMDB. However, we do not observe
Yahoo Answers
similar phenomena for S T on other datasets. Our 6
experimental results demonstrate that TAPT is a
simple, effective and strong learner for S SL tasks, 4
and it should serve as a baseline for S SL tasks in 2
NLP.
0
#2. TAPT tends to bring more improvements in
10 3 10 2 10 1 100
S SL than in F ULLY-S UPERVISED setting. We Labelled Size / Unlabelled Size
further study the behaviour of TAPT itself under
S SL, where we select S UPERVISED as the base- Figure 2: The impact of labelled size on the F1 im-
line rather than S T approaches. Figure 1 shows provement from TAPT over S UPERVISED, where un-
that the differences in performance (in absolute val- labelled size is fixed for each dataset. The red verti-
cal line highlights the F ULLY-S UPERVISED setting on
ues) between TAPT (red lines) and S UPERVISED
which prior work (Gururangan et al., 2020) focuses.
(green lines) generally increase as the labelled size
decreases. To gain a better understanding of the im-
pact of labelled data sizes, we plot the improvement setting. This finding is complementary to prior
from TAPT over S UPERVISED (in percentages) works (e.g. in Howard and Ruder, 2018; Gururan-
against the ratio between labelled size and unla- gan et al., 2020) that focus on TAPT’s improvement
belled size (unlabelled size is fixed for each dataset) from the F ULLY-S UPERVISED perspective, repre-
in Figure 2. We see that TAPT improves over S U - sented by the rightmost red vertical line in Figure 2.
PERVISED further as the ratio of labelled and un- The rising trend of the improvement is not mono-
labelled sizes decreases, highlighting the trends of tonic as the labelled size is reduced. Rather it could
gaining greater improvement in low-resource S SL provide insight into how TAPT improves over S U -
Imdb 30
SST-2 40
Amazon Review 20
Yahoo Answers 20

20k -0.3 20k -0.8 35


50k 0.5 1.1 50k 1.4 1.6
5k -1.0 -0.6 20
30
10k 1.0 1.7 2.2
15
10k 2.3 1.9 2.4 15

5k 0.2 -1.0
-1.0 -1.2 -1.0 2.9 1.8 0.8 2.0
Labelled Size

1k 25
2k
10
20
2k 0.7 -0.2 0.7 1.6 10 10

200 -0.3 -1.6 -2.1 -1.8 1k -0.2 0.9 0.1 500 3.8 2.2 1.2 -0.3 0.8
15
500 1.8 0.7 0.1 1.5 2.2
20 0.0 -3.0 -6.1 -9.6 -7.0 100 5.4 1.9 -1.0 -0.6 -1.4
0 5 5

200 1.4 5.3 1.8 -1.0 10

10 23.1 -9.9 8.7 7.3 -16.0 10


5
100 5.1 -0.0 -2.8 1.1 -0.3 0 20 6.2 1.9 -4.7 -2.4 1.8 0

4 26.1 29.8 29.8 30.1 23.5 40 21.4 15.7 22.6 29.0 40.0 0
20 6.3 9.6 12.7 15.5 7.5 10 10.8 13.6 6.4 12.9 12.2
20 5 5 5
0

0
2k

0
2k

1k

5k

0k

0k

1k

5k

0k

0k
10

23

10

60

20

20
10

50

10

50
Unlabelled Size Unlabelled Size

10

25

10

50
Unlabelled Size Unlabelled Size

Figure 3: Performance difference between TAPT and S T with varying labelled and unlabelled sizes on IMDB,
SST-2, A MAZON R EVIEW and YAHOO ! A NSWER. Positive values indicate that TAPT performs better, while
negative values indicate that S T performs better. Average Macro-F1 score on test sets over five seeds is reported.

PERVISED in S SL and inspire the design of new TAPT in Low-Resource Setting


approaches. 80 76.4 78.7 Supervised
71.7 74.2 TAPT

Macro-F1 (%)
5 Exploring the limits of S T and TAPT 60
In §4, our experimental results showed inconsistent 45.3 45.5 48.5
43.0
results across datasets. For example, S T performs 40
IMDB SST-2 viewYahoo Answers
better on IMDB while TAPT achieves better results Amazon Re
on A MAZON R EVIEW and YAHOO ! A NSWER. We
Figure 4: The performance of TAPT against the S U -
hypothesize that this might be attributed to the ex-
PERVISED baseline in the low-resource setting of unla-
posure to different sizes of labelled or unlabelled belled data. From the left to the right, TAPT utilizes
data. To verify this hypothesis and shed light on 100, 100, 1 000, and 1 000 unlabelled samples respec-
the differences in performance between datasets, tively.
we compare TAPT and S T (using A DA M ATCH and
F LEX M ATCH as representatives) by sampling dif-
ferent labelled and unlabelled sizes in IMDB, SST- in the low-resource setting of unlabelled data, we
2, A MAZON R EVIEW and YAHOO ! A NSWER. select the performance of TAPT and S UPERVISED
Figure 3 visualizes the differences in perfor- from the first column (the lowest unlabelled size)
mance between TAPT and S T, where each cell rep- for each dataset in Figure 3 and plot their average
resents the macro-F1 performance difference of performance over different labelled sizes. Figure 4
TAPT over S T (averaged across five seeds). In each shows that TAPT improves over the S UPERVISED
case, the highest performance among F LEX M ATCH baseline with just one hundred or one thousand
and A DA M ATCH is selected to represent the perfor- samples. For instance, TAPT achieves a 5.5% in-
mance of S T. Overall, we observe that: (1) TAPT crease in F1 score compared to the S UPERVISED
improves the fine-tuning performance even with a baseline when using only 1k unlabelled samples on
few hundred unlabelled examples; and (2) TAPT YAHOO ! A NSWER. Additionally, this performance
performs more stable across the different labelled is achieved without the need for large amounts of
and unlabelled data sizes than S T approaches. Be- tokens in each sample, as training samples from
low we provide a comprehensive analysis of the SST-2, on average, contain only 9 tokens and train-
impact of labelled and unlabelled sizes. ing samples from YAHOO ! A NSWER contain about
32 tokens (see examples in Table 6 of Appendix).
#1. TAPT works even with a few hundred un-
labelled samples. It is generally assumed that #2. Scarce labelled data and adequate un-
TAPT requires a large amount of unlabelled data to labelled data. TAPT appears to be a more
perform well (e.g. Li et al., 2021b; Hou et al., 2022). favourable choice than S T approaches in this set-
However, we surprisingly observe that TAPT can ting. The bottom of each sub-figure in Figure
bring substantial improvement over S UPERVISED 3 shows a clear labelled size boundary, below
baseline even with a relatively small number of which F LEX M ATCH and A DA M ATCH are outper-
unlabelled samples, as shown in Figure 5. To ex- formed by TAPT with a large margin, regardless
plore the effectiveness of TAPT over S UPERVISED of datasets and unlabelled size used. This sug-
Amazon Review Yahoo Answers
65 #Unl. 10 50 100 500
75
F LEX M ATCH 57.317.9 35.23.4 45.122.5 33.40.1
60
A DA M ATCH 53.322.1 36.86.1 33.50.2 33.60.3
70
Table 3: Test results on IMDB with 4 fixed labelled
Macro-F1 (%)

55 data. An average Macro-F1 score over five seeds is


65
reported.

50
60
TAPT against five different sizes of unlabelled data,
45 grouped by size (using similar colours). We note
55 that S T approaches perform worse than their corre-
103 104 105 103 104 105 sponding S UPERVISED baselines (represented by
Unlabelled Size Unlabelled Size
Labelled Size | Algorithm horizontal lines) until a certain amount of unla-
50k | st 2k | tapt st 10k | Supervised
50k | tapt 500 | st tapt 2k | Supervised belled data has been reached. For example, when
10k | st 500 | tapt Fully | Supervised 500 | Supervised
10k | tapt 100 | st 50k | Supervised 100 | Supervised the labelled size is 500, S T requires about 20k
2k | st 100 | tapt
unlabelled samples to achieve the corresponding
Figure 5: The performance of S T and TAPT using dif- S UPERVISED baseline performance on YAHOO !
ferent unlabelled sizes. Average test results across five A NSWER. On the other hand, TAPT generally out-
seeds are reported where the best result from F LEX - performs S UPERVISED baselines demonstrating its
M ATCH and A DA M ATCH is selected to represent S T. robustness across various unlabelled sizes.
To further quantify the model performance in
case of scarce unlabelled and adequate labelled
gests that S T might not be able to efficiently handle
data, we choose the three lowest unlabelled sizes
large amounts of unlabelled data if labelled data
(the first three columns) excluding the lowest la-
do not provide adequate information. This might
belled size (the last row) in Figure 3 for each
be attributed to confirmation bias (Tarvainen and
dataset. Our analysis shows that S T has 67%, 56%
Valpola, 2017; Arazo et al., 2020), which results
and 54% probability of falling below the S UPER -
from the accumulation of errors in the iterative S T
VISED baselines on SST-2, A MAZON R EVIEW ,
process caused by incorrect pseudo-labels.
and YAHOO ! A NSWER respectively. Even on
The specific value of adequate labelled size
IMDB where S T generally performs well, it still
boundary for S T approaches depends on the na-
has a probability of 33% to fall behind S UPER -
ture of the dataset. For example, even though both
VISED . In contrast, TAPT never performs worse
IMDB and SST-2 are binary classification tasks
than S UPERVISED in those cases. We provide com-
for movie review sentiment analysis, the labelled
putation details and comparative statistics in Ap-
size boundary for SST-2 is higher (40 > 4), indi-
pendix §C.
cating that this boundary tends to increase as the
The specific value of adequate unlabelled size
task becomes more challenging. While it may be
boundary for S T approaches depends on the nature
easy to obtain dozens of labelled data in this case,
of the dataset as well as the labelled size. Figure
when the task becomes more intricate or contains
5 illustrates that as the size of the labelled data
noisy weak labels, it is important to be aware of this
increases, S T approaches require more unlabelled
potential issue with S T approaches. TAPT could
data to surpass the S UPERVISED baselines. For
serve as an alternative in situations where collect-
example, on A MAZON R EVIEW, S T trained with
ing adequate labelled data for training is costly.
100 labelled samples requires about 5k unlabelled
We provide specific values of the performance of
samples to perform better than S UPERVISED, while
S T and TAPT, and further verify that this finding
S T trained with 10k labelled samples requires about
applies to other S T approaches in Appendix §D.
100k unlabelled samples. Adjusting the unlabelled
size accordingly might be conducive to exploiting
#3. Adequate labelled data and scarce unla-
the full potential of S T approaches.
belled data. In this setting, TAPT is more ro-
bust, while S T has a greater chance of performing #4. Scarce labelled and unlabelled data. When
worse than the S UPERVISED baseline. In Figure the labelled data is insufficient, increasing unla-
5, we plot the performance of S T approaches and belled size is not helpful or even detrimental to S T
Target Domain: IMDB | Labelled Size: 100 Target Domain: IMDB | Labelled Size: 200 Target Domain: SST-2 | Labelled Size: 100 Target Domain: SST-2 | Labelled Size: 200
IMDB SST-2 IMDB SST-2 SST-2 IMDB SST-2 IMDB
100 93.4 94.7 90.9 89.0 88.7
100 93.5 93.6 91.5 90.1 90.4 87.0 100 100
84.5 86.4 89.6 84.0 85.7 87.2 89.5 88.6 89.7 86.8 87.5
Macro-F1 (%)

83.0 83.1
80 80 80 80
60 60 60 60
40 33.3 33.3 40 33.3 33.3 40 33.4 33.4 40 33.0 33.0
20 20 20 20
0 0 0 0
FlexMatch AdaMatch TAPT Supervised FlexMatch AdaMatch TAPT Supervised FlexMatch AdaMatch TAPT Supervised FlexMatch AdaMatch TAPT Supervised

Figure 6: Results of U DA experiments. Legends indicate domains of labelled training data. Orange/green repre-
sents the performance with/without domain shift. Average Macro-F1 score on test sets over five seeds is reported.

Train (Lab.) Train (Unl.) #Lab. F LEX M ATCH A DA M ATCH TAPT S UPERVISED
IMDB 100 93.40.1 94.70.2 90.90.6 88.70.2 F
IMDB SST-2 100 89.11.2 (H4.6%) 87.62.2 (H7.5%) 89.90.6 (H1.1%) 88.70.2
A MAZON R EVIEW 100 92.10.7 (H1.4%) 92.40.2 (H2.4%) 91.40.3 (N0.6%) 88.70.2
IMDB 200 93.50.1 93.60.1 91.80.3 90.30.4 F
IMDB SST-2 200 89.52.4 (H4.3%) 88.91.0 (H5.0%) 90.30.4 (H1.6%) 90.30.4
A MAZON R EVIEW 200 92.50.4 (H1.1%) 92.70.5 (H1.0%) 92.10.2 (N0.3%) 90.30.4
SST-2 100 83.08.3 83.14.4 85.42.4 84.02.7 F
SST-2 IMDB 100 46.72.1 (H43.7%) 49.27.3 (H40.8%) 88.50.9 (N3.6%) 84.02.7
A MAZON R EVIEW 100 46.44.9 (H44.1%) 48.211.0 (H42.0%) 88.90.9 (N4.1%) 84.02.7
SST-2 200 87.23.9 89.50.9 88.60.9 86.80.3 F
SST-2 IMDB 200 62.77.4 (H28.1%) 61.02.8 (H31.8%) 89.11.1 (N0.6%) 86.80.3
A MAZON R EVIEW 200 61.87.7 (H29.1%) 56.010.3 (H17.4%) 89.41.0 (N0.9%) 86.80.3

Table 4: Results of S TL experiments. We report the average Macro-F1 score on the test set across five seeds, with
standard deviations as subscripts. Blue represents the best result for each row. Stars highlight rows without domain
shifts. Arrows in colours stand for the changes in performances against the star row result within each cell.

Task Lab. Unl. to exploit unlabelled data in this setting. Figure 3


Semi-supervised Learning Target Target shows that S T dominates in IMDB when more than
Unsupervised Domain Adaptation Source Target 10 labelled and 100 unlabelled samples are avail-
Self-taught Learning Target Source able. On the other hand, TAPT generally performs
better than S T on A MAZON R EVIEW and YAHOO !
Table 5: A summary of domain adaptation, where the A NSWER, indicating that the answer to which ap-
distribution of source and target domains are different. proach is better depends on the nature of the dataset
and task. As labelled and unlabelled data size in-
crease, the difference between S T and TAPT shrinks
approaches. This finding is well-illustrated in the (colours fade and lines converge in Figures 3 and
last row of results on SST-2 shown in Figure 3. In 5). As the labelled data in size reaches the unla-
other words, reducing the size of unlabelled data belled data, the method of S T reduces to F ULLY-
could be beneficial for S T approaches when the S UPERVISED, which is generally outperformed by
labelled size is inadequate. We further zoom in on TAPT (Gururangan et al., 2020).
this phenomenon in Table 3 by selecting 4 fixed
labelled and 500 unlabelled samples, and gradually 6 Domain Adaptation
removing unlabelled samples on IMDB. This is a
stark contrast to the case where more unlabelled We next investigate how S T and TAPT compare
data is beneficial for S T approaches when adequate in the presence of domain shifts between labelled
labelled data is available. Meanwhile, TAPT gen- and unlabelled data in two additional settings (refer
erally benefits from training on more in-domain to Table 5). First, we experiment with the Unsu-
unlabelled data, following the scaling law in LMs pervised Domain Adaptation (U DA) setting, where
(Kaplan et al., 2020; Hoffmann et al., 2022). domain shifts exist between the labelled data from
a source domain and the unlabelled data from the
#5. Adequate labelled and unlabelled data. target domain (Ben-David et al., 2010; Saito et al.,
Both S T and TAPT have demonstrated the ability 2018; Ramponi and Plank, 2020). Then, we ex-
periment with Self-taught Learning (S TL) (Raina drops by a much smaller margin (a few percentage
et al., 2007) in a domain adaptation setting, where points). This shows the importance of training S T
the unlabelled data come from the source domain approaches using more informative labelled data,
and the labelled data from the target domain. In which is also consistent with our findings in §5.
both settings, we use the (labelled) validation and TAPT in the S TL setting is in fact a variation of
test sets from the target domain. Validation and domain adaptive pre-training (Beltagy et al., 2019;
test sets are excluded from any pool of labelled or Gururangan et al., 2020) applied to S SL tasks. Ta-
unlabelled train data. ble 4 shows that the performance of TAPT remains
stable when there exist domain shifts in the unla-
#1. Unsupervised Domain Adaptation (U DA). belled data. Using more informative unlabelled
In this setting, we use two movie sentiment datasets, data can further improve the performance of TAPT.
IMDB and SST-2, as the source and target domain For example, using IMDB or A MAZON R EVIEW
(and vice versa) with two different sizes of labelled as unlabelled data when SST-2 is a target task, we
data (i.e. 100 and 200). see an improvement of about 4% with 100 labelled
Figure 6 depicts the performance of S T and samples. However, it is worth noting that S T meth-
TAPT in U DA. In case of domain shifts, we ob- ods can still be competitive compared to TAPT if
serve that F LEX M ATCH and A DA M ATCH fail to the source and target domains are relatively similar.
deliver satisfactory results and their performance For instance, when using A MAZON R EVIEW and
drops to the level of random guessing, with a F1 IMDB as the source and target domains, S T still
score of 33% across all labelled sizes and datasets. achieves better results than TAPT.
This highlights the vulnerability of S T approaches
in U DA. In contrast, TAPT demonstrates robust per- 7 Related Work
formance even with domain shifts, on par with its Leveraging unlabelled data by Continuing Pre-
own S SL performance without domain shifts. Ad- training. Previous work has shown that further
ditionally, TAPT even benefits from training on the pre-training LMs on the unlabelled data of a task
source domain. For instance, training on IMDB (e.g. Alsentzer et al., 2019; Mehri et al., 2020;
(source domain) further improves the performance Margatina et al., 2022) or in-domain data (e.g. Lo-
of TAPT on SST-2 (target domain) from 86.4% to geswaran et al., 2019; Gururangan et al., 2020;
89.6% with 100 labelled samples and from 88.6% Xue et al., 2021) is beneficial to downstream tasks.
to 89.7% with 200 labelled samples. However, it is unknown whether this is valid in
S SL settings. Previous studies in computer vision
#2. Self-taught Learning (S TL). We select
(Zoph et al., 2020) and speech recognition (Xu
IMDB, SST-2, and A MAZON R EVIEW for this
et al., 2021a) have compared PT and S T. However,
setting. Although they are all sentiment reviews
our study has a different focus, specifically, we
datasets, IMDB and A MAZON R EVIEW are more
compare TAPT and S T in NLP tasks. Concurrently
closely related (see the similarity analysis in Ta-
to our work, Shi and Lipani (2023) put forward
ble 7 of Appendix) and arguably contain richer
prompt-based continued pre-training, which pri-
language than SST-2 (see examples in Table 6 of
marily aims to enhance the performance of prompt-
Appendix).
based fine-tuning techniques (Schick and Schütze,
Table 4 presents the performance of S T and
2021; Gao et al., 2021). This approach outper-
TAPT in S TL setting. We find that domain shifts in
forms these state-of-the-art S T approaches (Sohn
unlabelled data consistently hurt the performance
et al., 2020; Xu et al., 2021b; Zhang et al., 2021;
of S T, depending on the similarity between the
Berthelot et al., 2022) as well as the conventional
source and target domains. The performance of
CLS-based fine-tuning with TAPT.
S T drops sharply if the source and target domains
are vastly different. For example, when SST-2 Semi-supervised Learning. Recent work in S SL
is used as the labelled data (target domain) and has demonstrated great progress in effectively ex-
IMDB or A MAZON R EVIEW is used as unlabelled ploiting unlabelled data. A wide range of ap-
data (source domain), the performance of S T falls proaches has been proposed including Pseudo La-
from over 80% to around 60% or lower. On the beling (Lee et al., 2013), Temporal Ensemble
other hand, when using SST-2 and IMDB as the (Laine and Aila, 2017), Mean Teacher (Tarvainen
source and target domains, the performance of S T and Valpola, 2017), Virtual Adversarial Training
(Miyato et al., 2018), FixMatch (Sohn et al., 2020). proaches in the domain shift setting. However, this
A major issue for S T approaches is confirmation doesn’t alter our conclusion that the effectiveness
bias, where the student model would accumulate of S T approaches significantly fluctuates across
errors from the teacher model when learning with different scenarios.
inaccurate pseudo-labels (e.g. Wang et al., 2021;
Goel et al., 2022; Chen et al., 2022).
While many efforts towards S T (e.g. Ruder and References
Plank, 2018; Gururangan et al., 2019; Li et al., Emily Alsentzer, John Murphy, William Boag, Wei-
2019; Chen et al., 2020b; Meng et al., 2020; Chen Hung Weng, Di Jindi, Tristan Naumann, and
et al., 2020a; He et al., 2020; Gera et al., 2022) Matthew McDermott. 2019. Publicly available clini-
cal BERT embeddings. In Proceedings of the 2nd
have been made in NLP, the performance of S T Clinical Natural Language Processing Workshop,
approaches across various labelled and unlabelled pages 72–78, Minneapolis, Minnesota, USA. Asso-
sizes has yet to be thoroughly explored. Although ciation for Computational Linguistics.
Mukherjee and Awadallah (2020); Li et al. (2021b)
Eric Arazo, Diego Ortego, Paul Albert, Noel E
noted that training S T approaches from TAPT O’Connor, and Kevin McGuinness. 2020. Pseudo-
checkpoints can improve the performance, the per- labeling and confirmation bias in deep semi-
formance of TAPT in S SL tasks has not been either supervised learning. In 2020 International Joint
well-researched by previous works or compared Conference on Neural Networks (IJCNN), pages 1–8.
IEEE.
with state-of-the-art S T approaches.
Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2018.
8 Conclusion A robust self-learning method for fully unsupervised
cross-lingual mappings of word embeddings. In Pro-
In this work, we shed light on how TAPT performs ceedings of the 56th Annual Meeting of the Associa-
against state-of-the-art S T approaches in various tion for Computational Linguistics (Volume 1: Long
Papers), pages 789–798, Melbourne, Australia. As-
S SL settings. Our experiments reveal that TAPT
sociation for Computational Linguistics.
achieves strong and robust performance, even with
just a few hundred unlabelled examples. We further Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. SciB-
demonstrate that the S T approaches are vulnerable ERT: A pretrained language model for scientific text.
In Proceedings of the 2019 Conference on Empirical
to small amounts of either labelled or unlabelled Methods in Natural Language Processing and the
data. We also find that TAPT is more robust than 9th International Joint Conference on Natural Lan-
S T approaches in joint domain adaptation and S SL guage Processing (EMNLP-IJCNLP), pages 3615–
settings. Overall, our empirical study demonstrates 3620, Hong Kong, China. Association for Computa-
tional Linguistics.
that TAPT is a strong S SL learner, competitive to
more sophisticated S T approaches. In future work, Shai Ben-David, John Blitzer, Koby Crammer, Alex
we plan to further explore the potential of TAPT Kulesza, Fernando Pereira, and Jennifer Wortman
with unsupervised learning signals. Vaughan. 2010. A theory of learning from different
domains. Machine learning, 79(1):151–175.
Limitations Yoshua Bengio, Jérôme Louradour, Ronan Collobert,
and Jason Weston. 2009. Curriculum learning. In
For easier comparison with previous work, we only Proceedings of the 26th Annual International Con-
focus on text classification tasks, while S T can also ference on Machine Learning, ICML ’09, page
be applied to a variety of NLP tasks, such as lan- 41–48, New York, NY, USA. Association for Com-
guage generation, conversational systems and com- puting Machinery.
monsense reasoning (Kedzie and McKeown, 2019; David Berthelot, Nicholas Carlini, Ekin D Cubuk, Alex
He et al., 2020; Shi et al., 2022a,b; Hendriksen Kurakin, Kihyuk Sohn, Han Zhang, and Colin Raf-
et al., 2022). We also assume that the datasets are fel. 2019a. Remixmatch: Semi-supervised learn-
ing with distribution alignment and augmentation an-
roughly balanced. However, real-world datasets are
choring. arXiv preprint arXiv:1911.09785.
usually class-imbalanced (Li et al., 2011), which
might impact the performance of TAPT and S T. David Berthelot, Nicholas Carlini, Ian J. Goodfellow,
While this is out of the scope of this paper, we be- Nicolas Papernot, Avital Oliver, and Colin Raffel.
2019b. Mixmatch: A holistic approach to semi-
lieve that this is an interesting avenue for future supervised learning. In Advances in Neural Infor-
work. Additionally, different labelled and unla- mation Processing Systems 32: Annual Conference
belled sizes may impact the performance of S T ap- on Neural Information Processing Systems 2019,
NeurIPS 2019, December 8-14, 2019, Vancouver, Computational Linguistics: Human Language Tech-
BC, Canada, pages 5050–5060. nologies, Volume 1 (Long and Short Papers), Min-
neapolis, Minnesota. Association for Computational
David Berthelot, Rebecca Roelofs, Kihyuk Sohn, Linguistics.
Nicholas Carlini, and Alexey Kurakin. 2022.
Adamatch: A unified approach to semi-supervised Xin Dong and Gerard de Melo. 2019. A robust self-
learning and domain adaptation. In International learning framework for cross-lingual text classifica-
Conference on Learning Representations. tion. In Proceedings of the 2019 Conference on
Empirical Methods in Natural Language Processing
Rui Cai and Mirella Lapata. 2019. Semi-supervised and the 9th International Joint Conference on Natu-
semantic role labeling with cross-view training. In ral Language Processing (EMNLP-IJCNLP), pages
Proceedings of the 2019 Conference on Empirical 6306–6310, Hong Kong, China. Association for
Methods in Natural Language Processing and the Computational Linguistics.
9th International Joint Conference on Natural Lan-
guage Processing (EMNLP-IJCNLP), pages 1018– Tianyu Gao, Adam Fisch, and Danqi Chen. 2021.
1027, Hong Kong, China. Association for Computa- Making pre-trained language models better few-shot
tional Linguistics. learners. In Proceedings of the 59th Annual Meet-
ing of the Association for Computational Linguistics
Ming-Wei Chang, Lev Ratinov, Dan Roth, and Vivek and the 11th International Joint Conference on Nat-
Srikumar. 2008. Importance of semantic representa- ural Language Processing (Volume 1: Long Papers),
tion: dataless classification. In Proceedings of the pages 3816–3830, Online. Association for Computa-
23rd national conference on Artificial intelligence- tional Linguistics.
Volume 2, pages 830–835. Ariel Gera, Alon Halfon, Eyal Shnarch, Yotam Perlitz,
Liat Ein-Dor, and Noam Slonim. 2022. Zero-shot
Olivier Chapelle, Bernhard Scholkopf, and Alexander text classification with self-training. In Proceedings
Zien. 2009. Semi-supervised learning (chapelle, o. of the 2022 Conference on Empirical Methods in
et al., eds.; 2006)[book reviews]. IEEE Transactions Natural Language Processing. Association for Com-
on Neural Networks, 20(3):542–542. putational Linguistics.
Baixu Chen, Junguang Jiang, Ximei Wang, Pengfei Arushi Goel, Yunlong Jiao, and Jordan Massiah. 2022.
Wan, Jianmin Wang, and Mingsheng Long. 2022. Pars: Pseudo-label aware robust sample selection
Debiased self-training for semi-supervised learning. for learning with noisy labels. arXiv preprint
In Advances in Neural Information Processing Sys- arXiv:2201.10836.
tems, NIPS’22.
Yves Grandvalet and Yoshua Bengio. 2004. Semi-
Jiaao Chen, Zhenghui Wang, Ran Tian, Zichao Yang, supervised learning by entropy minimization. Ad-
and Diyi Yang. 2020a. Local additivity based data vances in neural information processing systems, 17.
augmentation for semi-supervised NER. In Proceed-
ings of the 2020 Conference on Empirical Methods Suchin Gururangan, Tam Dang, Dallas Card, and
in Natural Language Processing (EMNLP), pages Noah A. Smith. 2019. Variational pretraining for
1241–1251, Online. Association for Computational semi-supervised text classification. In Proceedings
Linguistics. of the 57th Annual Meeting of the Association for
Computational Linguistics, pages 5880–5894, Flo-
Jiaao Chen, Zichao Yang, and Diyi Yang. 2020b. Mix- rence, Italy. Association for Computational Linguis-
Text: Linguistically-informed interpolation of hid- tics.
den space for semi-supervised text classification. In Suchin Gururangan, Ana Marasović, Swabha
Proceedings of the 58th Annual Meeting of the Asso- Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey,
ciation for Computational Linguistics, pages 2147– and Noah A. Smith. 2020. Don’t stop pretraining:
2157, Online. Association for Computational Lin- Adapt language models to domains and tasks. In
guistics. Proceedings of the 58th Annual Meeting of the
Association for Computational Linguistics, pages
Kevin Clark, Minh-Thang Luong, Christopher D. Man-
8342–8360, Online. Association for Computational
ning, and Quoc Le. 2018. Semi-supervised se-
Linguistics.
quence modeling with cross-view training. In Pro-
ceedings of the 2018 Conference on Empirical Meth- Junxian He, Jiatao Gu, Jiajun Shen, and Marc’Aurelio
ods in Natural Language Processing, pages 1914– Ranzato. 2020. Revisiting self-training for neural
1925, Brussels, Belgium. Association for Computa- sequence generation. In International Conference
tional Linguistics. on Learning Representations.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Mariya Hendriksen, Maurits Bleeker, Svitlana Vaku-
Kristina Toutanova. 2019. BERT: Pre-training of lenko, Nanne van Noord, Ernst Kuiper, and Maarten
deep bidirectional transformers for language under- de Rijke. 2022. Extending clip for category-to-
standing. In Proceedings of the 2019 Conference of image retrieval in e-commerce. In Advances in In-
the North American Chapter of the Association for formation Retrieval: 44th European Conference on
IR Research, ECIR 2022, Stavanger, Norway, April Shiyang Li, Semih Yavuz, Wenhu Chen, and Xifeng
10–14, 2022, Proceedings, Part I, page 289–303, Yan. 2021b. Task-adaptive pre-training and self-
Berlin, Heidelberg. Springer-Verlag. training are complementary for natural language un-
derstanding. In Findings of the Association for Com-
Jordan Hoffmann, Sebastian Borgeaud, Arthur Men- putational Linguistics: EMNLP 2021, pages 1006–
sch, Elena Buchatskaya, Trevor Cai, Eliza Ruther- 1015, Punta Cana, Dominican Republic. Association
ford, Diego de Las Casas, Lisa Anne Hendricks, for Computational Linguistics.
Johannes Welbl, Aidan Clark, et al. 2022. Train-
ing compute-optimal large language models. arXiv Shoushan Li, Zhongqing Wang, Guodong Zhou, and
preprint arXiv:2203.15556. Sophia Yat Mei Lee. 2011. Semi-supervised learn-
ing for imbalanced sentiment classification. In Pro-
ceedings of the Twenty-Second International Joint
Zejiang Hou, Julian Salazar, and George Polovets. Conference on Artificial Intelligence - Volume Vol-
2022. Meta-Learning the Difference: Preparing ume Three, IJCAI’11, page 1826–1831. AAAI
Large Language Models for Efficient Adaptation. Press.
Transactions of the Association for Computational
Linguistics, 10:1249–1265. Zhenghua Li, Xue Peng, Min Zhang, Rui Wang, and
Luo Si. 2019. Semi-supervised Domain Adapta-
Jeremy Howard and Sebastian Ruder. 2018. Universal tion for Dependency Parsing. In Proceedings of the
language model fine-tuning for text classification. In 57th Annual Meeting of the Association for Com-
Proceedings of the 56th Annual Meeting of the As- putational Linguistics, pages 2386–2395, Florence,
sociation for Computational Linguistics (Volume 1: Italy. Association for Computational Linguistics.
Long Papers), pages 328–339, Melbourne, Australia.
Association for Computational Linguistics. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
Jared Kaplan, Sam McCandlish, Tom Henighan, Luke Zettlemoyer, and Veselin Stoyanov. 2019.
Tom B Brown, Benjamin Chess, Rewon Child, Scott Roberta: A robustly optimized bert pretraining ap-
Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. proach. arXiv preprint arXiv:1907.11692.
2020. Scaling laws for neural language models.
arXiv preprint arXiv:2001.08361. Lajanugen Logeswaran, Ming-Wei Chang, Kenton Lee,
Kristina Toutanova, Jacob Devlin, and Honglak Lee.
2019. Zero-shot entity linking by reading entity de-
Chris Kedzie and Kathleen McKeown. 2019. A good scriptions. In Proceedings of the 57th Annual Meet-
sample is hard to find: Noise injection sampling and ing of the Association for Computational Linguistics,
self-training for neural language generation models. pages 3449–3460, Florence, Italy. Association for
In Proceedings of the 12th International Conference Computational Linguistics.
on Natural Language Generation, pages 584–593,
Tokyo, Japan. Association for Computational Lin- Andrew L. Maas, Raymond E. Daly, Peter T. Pham,
guistics. Dan Huang, Andrew Y. Ng, and Christopher Potts.
2011. Learning word vectors for sentiment analy-
Thomas N. Kipf and Max Welling. 2017. Semi- sis. In Proceedings of the 49th Annual Meeting of
supervised classification with graph convolutional the Association for Computational Linguistics: Hu-
networks. In International Conference on Learning man Language Technologies, pages 142–150, Port-
Representations (ICLR). land, USA. Association for Computational Linguis-
tics.
Samuli Laine and Timo Aila. 2017. Temporal ensem-
bling for semi-supervised learning. In 5th Inter- Katerina Margatina, Loic Barrault, and Nikolaos Ale-
national Conference on Learning Representations, tras. 2022. On the importance of effectively adapt-
ICLR 2017, Toulon, France, April 24-26, 2017, Con- ing pretrained language models for active learning.
ference Track Proceedings. [Link]. In Proceedings of the 60th Annual Meeting of the
Association for Computational Linguistics (Volume
2: Short Papers), pages 825–836, Dublin, Ireland.
Dong-Hyun Lee et al. 2013. Pseudo-label: The simple
Association for Computational Linguistics.
and efficient semi-supervised learning method for
deep neural networks. In Workshop on challenges Julian McAuley and Jure Leskovec. 2013. Hidden fac-
in representation learning, ICML, page 896. tors and hidden topics: Understanding rating dimen-
sions with review text. In Proceedings of the 7th
Changchun Li, Ximing Li, and Jihong Ouyang. 2021a. ACM Conference on Recommender Systems, RecSys
Semi-supervised text classification with balanced ’13, page 165–172, New York, NY, USA. Associa-
deep representation distributions. In Proceedings of tion for Computing Machinery.
the 59th Annual Meeting of the Association for Com-
putational Linguistics and the 11th International David McClosky, Eugene Charniak, and Mark Johnson.
Joint Conference on Natural Language Processing 2006. Effective self-training for parsing. In Pro-
(Volume 1: Long Papers), pages 5044–5053, Online. ceedings of the Human Language Technology Con-
Association for Computational Linguistics. ference of the NAACL, Main Conference, pages 152–
159, New York City, USA. Association for Compu- Timo Schick and Hinrich Schütze. 2021. Exploiting
tational Linguistics. cloze-questions for few-shot text classification and
natural language inference. In Proceedings of the
Shikib Mehri, Mihail Eric, and Dilek Z. Hakkani-Tür. 16th Conference of the European Chapter of the As-
2020. Dialoglue: A natural language understand- sociation for Computational Linguistics: Main Vol-
ing benchmark for task-oriented dialogue. ArXiv, ume, pages 255–269, Online. Association for Com-
abs/2009.13570. putational Linguistics.
Yu Meng, Yunyi Zhang, Jiaxin Huang, Chenyan Xiong, Zhengxiang Shi, Yue Feng, and Aldo Lipani. 2022a.
Heng Ji, Chao Zhang, and Jiawei Han. 2020. Text Learning to execute actions or ask clarification ques-
classification using label names only: A language tions. In Findings of the Association for Com-
model self-training approach. In Proceedings of the putational Linguistics: NAACL 2022, pages 2060–
2020 Conference on Empirical Methods in Natural 2070, Seattle, United States. Association for Com-
Language Processing (EMNLP), pages 9006–9017, putational Linguistics.
Online. Association for Computational Linguistics.
Zhengxiang Shi and Aldo Lipani. 2023. Don’t stop pre-
Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, training? make prompt-based fine-tuning powerful
and Shin Ishii. 2018. Virtual adversarial training: learner. arXiv preprint arXiv:2305.01711.
a regularization method for supervised and semi-
supervised learning. IEEE transactions on pattern Zhengxiang Shi, Qiang Zhang, and Aldo Lipani. 2022b.
analysis and machine intelligence, 41:1979–1993. Stepgame: A new benchmark for robust multi-
hop spatial reasoning in texts. In Proceedings of
Subhabrata Mukherjee and Ahmed Awadallah. 2020. the AAAI Conference on Artificial Intelligence, vol-
Uncertainty-aware self-training for few-shot text ume 36, pages 11321–11329.
classification. In Advances in Neural Informa-
tion Processing Systems, volume 33, pages 21199– Kihyuk Sohn, David Berthelot, Chun-Liang Li, Zizhao
21212. Curran Associates, Inc. Zhang, Nicholas Carlini, Ekin D. Cubuk, Alex Ku-
rakin, Han Zhang, and Colin Raffel. 2020. Fix-
Myle Ott, Sergey Edunov, Alexei Baevski, Angela
match: Simplifying semi-supervised learning with
Fan, Sam Gross, Nathan Ng, David Grangier, and
consistency and confidence. In Proceedings of the
Michael Auli. 2019. fairseq: A fast, extensible
34th International Conference on Neural Informa-
toolkit for sequence modeling. In Proceedings of
tion Processing Systems, NIPS’20, Red Hook, NY,
NAACL-HLT 2019: Demonstrations.
USA. Curran Associates Inc.
Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
Dario Amodei, Ilya Sutskever, et al. 2019. Lan- Antti Tarvainen and Harri Valpola. 2017. Mean teach-
guage models are unsupervised multitask learners. ers are better role models: Weight-averaged consis-
OpenAI blog, 1(8). tency targets improve semi-supervised deep learning
results. In Proceedings of the 31st International
Rajat Raina, Alexis Battle, Honglak Lee, Benjamin Conference on Neural Information Processing Sys-
Packer, and Andrew Y. Ng. 2007. Self-taught learn- tems, NIPS’17, page 1195–1204, Red Hook, NY,
ing: Transfer learning from unlabeled data. In Pro- USA. Curran Associates Inc.
ceedings of the 24th International Conference on
Machine Learning, ICML ’07, page 759–766, New Alex Wang, Amanpreet Singh, Julian Michael, Fe-
York, NY, USA. Association for Computing Machin- lix Hill, Omer Levy, and Samuel Bowman. 2018.
ery. GLUE: A multi-task benchmark and analysis plat-
form for natural language understanding. In Pro-
Alan Ramponi and Barbara Plank. 2020. Neural Un- ceedings of the 2018 EMNLP Workshop Black-
supervised Domain Adaptation in NLP—A Survey. boxNLP: Analyzing and Interpreting Neural Net-
In Proceedings of the 28th International Conference works for NLP, pages 353–355, Brussels, Belgium.
on Computational Linguistics, pages 6838–6855, Association for Computational Linguistics.
Barcelona, Spain (Online). International Committee
on Computational Linguistics. Ximei Wang, Jinghan Gao, Mingsheng Long, and Jian-
min Wang. 2021. Self-tuning for data-efficient deep
Sebastian Ruder and Barbara Plank. 2018. Strong learning. In International Conference on Machine
Baselines for Neural Semi-Supervised Learning un- Learning (ICML).
der Domain Shift. In Proceedings of the 56th An-
nual Meeting of the Association for Computational Yidong Wang, Hao Chen, Yue Fan, Wang SUN, Ran
Linguistics (Volume 1: Long Papers), pages 1044– Tao, Wenxin Hou, Renjie Wang, Linyi Yang, Zhi
1054, Melbourne, Australia. Association for Compu- Zhou, Lan-Zhe Guo, Heli Qi, Zhen Wu, Yu-Feng Li,
tational Linguistics. Satoshi Nakamura, Wei Ye, Marios Savvides, Bhik-
sha Raj, Takahiro Shinozaki, Bernt Schiele, Jindong
Kuniaki Saito, Kohei Watanabe, Yoshitaka Ushiku, and Wang, Xing Xie, and Yue Zhang. 2022. USB: A uni-
Tatsuya Harada. 2018. Maximum classifier discrep- fied semi-supervised learning benchmark for classi-
ancy for unsupervised domain adaptation. In Pro- fication. In Thirty-sixth Conference on Neural In-
ceedings of the IEEE conference on computer vision formation Processing Systems Datasets and Bench-
and pattern recognition, pages 3723–3732. marks Track.
Qizhe Xie, Zihang Dai, Eduard Hovy, Minh-Thang Lu-
ong, and Quoc V. Le. 2020a. Unsupervised data aug-
mentation for consistency training. In Proceedings
of the 34th International Conference on Neural In-
formation Processing Systems, NIPS’20, Red Hook,
NY, USA. Curran Associates Inc.
Qizhe Xie, Minh-Thang Luong, Eduard Hovy, and
Quoc V Le. 2020b. Self-training with noisy student
improves imagenet classification. In Proceedings of
the IEEE/CVF conference on computer vision and
pattern recognition, pages 10687–10698.
Qiantong Xu, Alexei Baevski, Tatiana Likhomanenko,
Paden Tomasello, Alexis Conneau, Ronan Collobert,
Gabriel Synnaeve, and Michael Auli. 2021a. Self-
training and pre-training are complementary for
speech recognition. In ICASSP 2021-2021 IEEE
International Conference on Acoustics, Speech and
Signal Processing (ICASSP), pages 3030–3034.
IEEE.
Yi Xu, Lei Shang, Jinxing Ye, Qi Qian, Yu-Feng Li,
Baigui Sun, Hao Li, and Rong Jin. 2021b. Dash:
Semi-supervised learning with dynamic threshold-
ing. In International Conference on Machine Learn-
ing, pages 11525–11536. PMLR.
Linting Xue, Noah Constant, Adam Roberts, Mi-
hir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya
Barua, and Colin Raffel. 2021. mT5: A massively
multilingual pre-trained text-to-text transformer. In
Proceedings of the 2021 Conference of the North
American Chapter of the Association for Computa-
tional Linguistics: Human Language Technologies,
pages 483–498, Online. Association for Computa-
tional Linguistics.
David Yarowsky. 1995. Unsupervised word sense dis-
ambiguation rivaling supervised methods. In 33rd
Annual Meeting of the Association for Computa-
tional Linguistics, pages 189–196, Cambridge, Mas-
sachusetts, USA. Association for Computational
Linguistics.
Bowen Zhang, Yidong Wang, Wenxin Hou, Hao Wu,
Jindong Wang, Manabu Okumura, and Takahiro Shi-
nozaki. 2021. Flexmatch: Boosting semi-supervised
learning with curriculum pseudo labeling. In Pro-
ceedings of the 35th International Conference on
Neural Information Processing Systems, volume 34.
Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015.
Character-level convolutional networks for text clas-
sification. In Proceedings of the 28th International
Conference on Neural Information Processing Sys-
tems - Volume 1, NIPS’15, page 649–657, Cam-
bridge, MA, USA. MIT Press.
Barret Zoph, Golnaz Ghiasi, Tsung-Yi Lin, Yin Cui,
Hanxiao Liu, Ekin D. Cubuk, and Quoc V. Le. 2020.
Rethinking pre-training and self-training. In Pro-
ceedings of the 34th International Conference on
Neural Information Processing Systems, NIPS’20,
Red Hook, NY, USA. Curran Associates Inc.
Appendix Overview SST-2. The SST-2 dataset (Wang et al., 2018)
consists of sentences from movie reviews and hu-
The appendix is structured as follows: man annotations of their sentiment. The task is to
predict the sentiment of a given sentence. Similar
Appendix §A provides a brief description and to IMDB, this is also a binary classification task.
example for each dataset (subsection §A.1). Addi- There are 67 349 and 872 for training and testing.
tionally, a similarity analysis among datasets and an We select 60 000 and 7 349 samples from the train
illustration of overlaps between IMDB and A MA - set to form a train and validation set, respectively,
ZON R EVIEW are included (subsection §A.2).
where the validation set contains 3 675 and 3 674
samples for two classes, respectively.
Appendix §B presents a brief description of
state-of-the-art S T approaches. AG N EWS. The AG N EWS topic classification
dataset is constructed by Zhang et al. (2015), where
Appendix §C includes a supplementary Table 4 classes are used. Each class contains 30 000 train-
that examines the effect of low unlabelled data ing samples and 1 900 test samples. We follow
sizes. Wang et al. (2022) to split the dataset by selecting
25 000 samples and 2 500 samples per class from
Appendix §D presents additional experiments to
the train set samples to form a train and validation
verify our findings using other S T approaches.
set, respectively.
Appendix §E includes additional experiments to A MAZON R EVIEW. The A MAZON R EVIEW
train S T approaches using TAPT checkpoints. dataset (McAuley and Leskovec, 2013) is a senti-
ment classification dataset, with five classes. There
Appendix §F provides implementation details are 600 000 train samples and 130 000 test samples
and hyperparameters for TAPT, S T, and F T meth- per class. We follow Wang et al. (2022) to split
ods used in our experiments. the dataset by selecting 50 000 samples and 5 000
samples per class from the train set samples to form
A Datasets a train and validation set, respectively.
In this section, we briefly introduce the datasets YAHOO ! A NSWER. The YAHOO ! A NSWER
used in our work and provide additional analysis dataset (Chang et al., 2008) is a topic classifica-
of the similarity among them. Specifically, we tion dataset, with ten classes. There are 140 000
provide four examples to demonstrate the overlap train samples and 6 000 test samples per class. We
between IMDB and A MAZON R EVIEW, as a sup- follow Wang et al. (2022) to split the dataset by
plement to our domain adaptation analysis (§6). selecting 50 000 samples and 5 000 samples per
class from the train set samples to form a train and
A.1 Description validation set, respectively.
In this section, we briefly introduce IMDB, SST-
A.2 Dataset Similarity
2, AG N EWS, A MAZON R EVIEW, and YAHOO !
A NSWER datasets. Table 6 list examples for each We provide an analysis of the vocabulary overlap of
dataset. the datasets, as shown in Figure 7. Additionally, in
Table 7, we provide some examples to illustrate the
IMDB. The IMDB dataset (Maas et al., 2011) overlap between IMDB and A MAZON R EVIEW.
contains a collection of 50 000 reviews from the As shown in Table 6, although both the SST-2
Internet Movie Database, with no more than 30 and IMDB datasets are sentiment analysis tasks for
reviews per movie. This dataset contains an equal movie reviews, the SST-2 datasets contain shorter
number of positive and negative reviews, yielding and vaguer sentences than the IMDB dataset. This
a 33% Marco-F1 score for random guessing. There difference could be a potential reason for poor per-
are 25 000 and 25 000 for training and testing, re- formance of S T approaches in the U DA setting (§6).
spectively. We follow Wang et al. (2022) to split In contrast, the A MAZON R EVIEW dataset, which
the dataset by selecting 12 500 samples and 1 000 is a product review sentiment analysis dataset, is
samples per class from the train set to form a train more similar to the IMDB dataset than the SST-2
and validation set, respectively. dataset, as shown in Table 7. This suggests a poten-
IMDB 100.0 33.4 36.3 33.9 29.8

SST-2 33.4 100.0 21.6 15.4 13.8

AG News 36.3 21.6 100.0 28.0 29.9

Amazon Review 33.9 15.4 28.0 100.0 40.2

Yahoo! Answer 29.8


DB 13.8 29.9 40.2 100.0

iew
T-2

r
s

we
ew
IM

SS

ev

ns
N

nR

!A
AG

oo
azo

Yah
Am
Figure 7: Vocabulary overlap (%) across datasets.

tial reason for the performance of S T and TAPT in concept of curriculum learning (Bengio et al., 2009)
the S TL setting (§6). to flexibly adjust thresholds for different classes at
each time step and select unlabelled data and their
B S T Frameworks pseudo labels that are more likely to be informative.
VAT. VAT (Miyato et al., 2018) proposed a regu- A DA M ATCH. A DA M ATCH (Berthelot et al.,
larization technique that forces pairs of data points 2022) aims to solve domain adaptation problems
that are very close in the input space to be close in S SL and build a high-accuracy model that
to each other in the output space. VAT adds small trains on and tests on different data distributions.
perturbation to the input data and forces the model A DA M ATCH builds on F IX M ATCH and introduces
to produce similar predictions. a relative confidence threshold and a modified dis-
tribution alignment from (Berthelot et al., 2019a).
F IX M ATCH. F IX M ATCH (Sohn et al., 2020)
generates artificial labels using both consistency C Probability of performing worsen than
regularization and pseudo-labelling, where the S UPERVISED.
artificial labels are produced based on weakly-
In §5, we discuss that we select the model perfor-
augmented unlabelled data. These artificial la-
mance with the three lowest unlabelled sizes (the
bels are then used as targets to train the model on
first three columns in Figure 3) for each dataset
strongly-augmented unlabelled data. F IX M ATCH
and exclude the model performance with the low-
only retains an artificial label if the model assigns
est labelled size (the last row in Figure 3). This
a high probability to one of the possible classes.
results in 9 cells in IMDB, 3 cells in SST-2, 9
DASH. DASH (Xu et al., 2021b) extends F IX - cells in A MAZON R EVIEW, and 12 cells in YA -
M ATCH by introducing a mechanism with a dynam- HOO ! A NSWER , where TAPT has one run per cell

ically adjusted threshold of loss to select a subset and S T (F LEX M ATCH and A DA M ATCH) has two
of training examples from the unlabelled data for runs per cell. We consider a run to be a failure
performing S SL. if its performance is worse than its corresponding
S UPERVISED baseline.
F LEX M ATCH. F LEX M ATCH (Zhang et al., Table 8 lists the probability of S T and TAPT
2021) also extends F IX M ATCH by introducing the of falling below the S UPERVISED baseline with
selected combinations of labelled and unlabelled F Implementation Details
sizes.
We consistently use five random seeds, ranging
from 1 to 5, for all algorithms. The sampled la-
D Further validations with other S T belled data is the same for all algorithms for a
approaches given seed. The development and test sets remain
unchanged for all different labelled and unlabelled
In this section, we conduct additional experiments data sizes.
on S T approaches, including VAT, DASH, and F IX - Our model implementation uses open-source
M ATCH to demonstrate that our findings are appli- libraries including HuggingFace Transformers2 ,
cable to other S T approaches as well. Fairseq3 , and USB4 . Our experiments of TAPT are
In Table 9, we select several combinations of performed on 8x32GB V100 GPUs, with a batch
labelled and unlabelled sizes on IMDB, SST- size of 16 per device and 2 gradient accumulation
2, A MAZON R EVIEW, and YAHOO ! A NSWER steps.
datasets. Our experimental results show that other Table 12 lists the hyperparameters used for the
S T approaches do not perform well when the la- TAPT phrase. Table 13 lists the hyperparameters
belled size is low, and that other S T approaches used for the fine-tuning phrase. Table 14 lists the
have a high probability to perform worsen than hyperparameters used for S T approaches.
S UPERVISED baselines when the unlabelled size
is low. This suggests that poor performance when
the labelled or unlabelled size is inadequate may
be a common problem of state-of-the-art S T ap-
proaches.

E Train S T approaches with TAPT


checkpoints

Previous works (Mukherjee and Awadallah, 2020;


Li et al., 2021b) have suggested that training S T
approaches from a TAPT checkpoint may be ben-
eficial. Here we also provide some additional ex-
periments to train S T approaches with TAPT check-
points to further corroborate our findings.
Table 10 shows that TAPT outperforms
A DA M ATCH +TAPT or F LEX M ATCH +TAPT with
two different labelled sizes on the YAHOO ! A N -
SWER dataset.
Table 11 shows that training S T approaches from
TAPT checkpoints could improve the performance
of S T but cannot solve the issue of S T approaches
when labelled or unlabelled data is not adequate.
Specifically, the performance of S T +TAPT is still
poor when labelled data is insufficient, as discussed
in §5. Meanwhile, in Table 11, the performance
of S T +TAPT could be outperformed by the S U -
PERVISED baselines when unlabelled data is inad-
equate, while TAPT consistently outperforms the
S UPERVISED baselines. When the labelled size
is 10, the performance of S T trained with fewer
unlabelled samples tends to be better, indicating 2
[Link]
that reducing the number of unlabelled data can be 3
[Link]
4
helpful, as discussed in §5. [Link]
Table 6: Examples for datasets.

Dataset Example
IMDB I watched this movie after seeing other comments on IMDb, even convincing my wife that it was a "unique horror
movie." I wanted to like this movie, but was unable [Link] "love story" was good, but the horror aspect was quite bad. If
the story was just about a young man who fell in love with a girl suffering from parasomnia, then it would have been a
better [Link] care centre stretched credulity well past the limits, in fact it was quite ridiculous. The doctor happily
ignors privacy laws and professionalism. A nurse goes into a room for a routine feeding of a dangerous patient (without
security escort), and drops the tray and runs out of the room screaming for no apparent reason. The forensic patient (and
the film’s villain) is tied up in a standing position fully clothed - apparently for years? None of it makes much [Link]
movie even had some actors that I’ve liked in other things, such as the detectives, but still I can’t recommend this movie.
SST-2 a rewarding work of art for only the most patient and challenge-hungry moviegoers.
AG N EWS Teen flies in plane #39;s landing gearA homeless teenager who hid in the landing gear of a passenger plane survived
a 700-kilometre flight across south-western China but his companion fell and probably died, state media reported on
Friday.
A MAZON R EVIEW THIS is MUSIC at its BESTRob Dougan has done it. He’s crafted musical perfection, or close to it anyway. I have
finally found the music I’ve been waiting for my whole life in this album - Rob D you are a genius. I think a lot of us
wanted to know more about this guy as soon as we heard the track playing to the "Woman in the Red Dress" scene. Now
I know why the Wachowski brothers have enlisted his musical talents to flesh out their movies.I know I should be trying
to write a more helpful, objective review but I can do nothing but wax poetic for Rob Dougan and his debut album. He
has mixed classical melodies with awesome electric beats and it all comes together in an audio orgy. Just buy the album
already and let’s get Rob some more mainstream recognition.
YAHOO ! A NSWER Does anybody know a great deal about angels? I’m looking for names, if they’re good or bad, what they look like, etc.
The more detail the better. All religions accepted

Table 7: Similarity analysis between IMDB and A MAZON R EVIEW with four examples that highlight the overlap.

IMDB A MAZON R EVIEW


I loved this movie since I was 7 and I saw it on the opening day. This is a very touching, spiritual movie! When I first saw this
It was so touching and beautiful. I strongly recommend seeing film, [...]. I was deeply moved by this motion picture, and the
for all. It’s a movie to watch with your family by far. My DVD brings the story to your own home. The bonus materials
MPAA rating: PG-13 for thematic elements, prolonged scenes could be better, but the main part of the DVD is the actual movie.
of disastor, nudity/sexuality and some language. Great, great, great film... [...]
Pacino is over-the-top but to good effect as he’s clearly having Makes a great gift! We bought this book for my dad for Father’s
loads of fun. Beatty is great [...] The lighting, velvet overtones Day this year, and thought he would have fun reading it since
and smog/smoke combine to create a great [Link] are he has four granddaughters. He loved it and has even selected
some really funny cameos [...] Highly recommended. 4.5/5 stories to read to the girls during over-nights with Grandpa and
stars. [...] Grandma. I highly recommend it as a great gift.
The late [...] scripted this tale of terror and it was absolutely Movia ... please .... This movie is a masterpiece of terror & sus-
one of the scariest movies I ever saw as a kid. (I had to walk pence & Beautifully filmed & [Link] to reality are
MILES just to see a movie, and it was usually dark when I not allowed when reviewing films of this caliber. Your reaction
emerged from the theater; seeing a horror movie was always (though it MAY be sarcastic) is EXACT proof of it’s genius!
unnerving [...] Watch it again...and this time....bask in all it’s glory!
Fabulous actors, beautiful scenery, stark reality [...] I tried to Well worth the import price. My first impression of this album
buy the video for several years, finally bought it used from a was a good one, but as time went on it came to grow on me more
video store that went out of business. But Yippee! The DVD is and more. This is certainly one of the better Costes albums. The
now for sale, I purchased it on [Link]. Not cheap, but well mixing is nothing revolutionary, but it is well done and all tracks
worth it to me. [...] flow into each other very well. [...].

Table 8: Results on the effect of low unlabelled sizes on S T and TAPT. Failure means performing worsen than
S UPERVISED.

Task #Unl. #Lab. Prob. of S T Failure Prob. of TAPT Failure


IMDB 100, 500, 2k 10, 20, 200, 1k 6/18 (33%) 0/9 (0%)
SST-2 100, 500, 2k 40, 200, 1k, 5k 4/6 (67%) 0/3 (0%)
A MAZON R EVIEW 1k, 5k, 20k 100, 500, 2k, 10k 10/18 (56%) 0/9 (0%)
YAHOO ! A NSWER 1k, 5k, 20k 20, 100, 500, 2k, 10k 13/24 (54%) 0/12 (0%)
Table 9: We further verify our conclusion on VAT, DASH, F IX M ATCH that . We report the average Macro-F1
score on the test set across five seeds, with standard deviations as subscripts. Blue represents the best results for
each row.

Dataset #Unl. #Lab. VAT F IX M ATCH DASH F LEX M ATCH A DA M ATCH TAPT S UPERVISED
100 4 33.50.2 33.40.1 33.40.1 35.74.2 34.10.7 61.86.7 59.44.8
100 10 61.620.1 45.421.6 34.72.2 49.019.9 52.421.0 75.56.9 71.88.5
100 20 87.12.2 64.616.5 67.816.6 85.52.9 79.17.6 85.51.0 84.11.9
IMDB 500 4 33.40.0 33.40.1 33.40.1 33.40.1 33.60.3 63.47.2 58.27.1
2k 4 33.30.0 33.30.0 33.30.0 33.30.0 33.30.0 63.16.2 60.95.6
10k 4 33.30.0 33.50.3 33.30.0 34.01.2 33.60.4 64.18.9 62.47.9
23k 4 33.30.0 33.30.0 57.429.4 45.323.9 33.30.0 68.85.6 65.610.4
100 40 63.310.6 46.99.7 47.97.0 57.24.5 51.014.0 78.72.5 76.43.7
500 40 55.716.8 53.88.9 51.210.0 67.710.7 59.111.4 83.34.8 72.97.9
500 200 83.01.6 84.52.8 82.63.5 83.83.0 87.41.9 88.80.9 88.30.9
SST-2
2k 40 55.924.2 36.43.0 35.32.0 56.66.7 49.313.8 79.35.9 71.78.2
10k 40 73.520.5 38.911.4 35.62.6 56.912.5 36.22.9 85.91.0 78.57.5
60k 40 79.613.4 32.61.7 33.40.6 40.67.7 42.613.3 82.64.0 75.37.2
1k 20 13.55.2 14.95.6 20.33.0 25.83.2 20.71.1 32.01.8 32.52.2
1k 100 46.12.2 36.33.1 35.36.2 43.41.7 40.32.2 48.50.9 48.22.2
1k 500 52.60.2 50.81.5 49.51.0 54.11.0 52.81.1 55.90.3 55.30.5
A MAZON R EVIEW 5k 20 15.57.8 13.53.3 22.25.2 23.27.3 16.96.9 32.83.4 32.32.5
20k 20 19.37.5 15.23.9 20.56.4 19.110.0 19.36.3 32.03.2 31.63.6
100k 20 14.17.3 11.92.9 20.75.2 15.32.6 12.53.7 30.73.6 30.83.9
250k 20 10.35.0 10.93.6 22.05.7 22.74.9 14.45.6 30.22.4 32.13.1
1k 10 1.90.1 2.00.1 4.62.9 15.72.6 18.87.9 29.65.8 23.54.5
1k 20 6.72.8 10.14.2 9.63.2 32.79.1 28.85.8 38.94.1 34.13.6
1k 100 55.21.7 46.94.4 45.33.7 54.21.4 53.91.3 59.70.8 57.41.6
1k 500 59.20.4 61.60.6 60.71.3 61.91.1 61.50.9 65.80.3 65.50.2
YAHOO ! A NSWER
5k 10 1.80.0 3.22.6 3.72.7 16.410.8 17.811.7 31.45.1 25.73.9
20k 10 2.40.9 2.00.3 4.93.1 7.34.7 25.212.2 32.45.6 27.24.4
100k 10 2.30.6 3.82.5 3.42.9 2.91.1 17.711.4 30.83.8 28.05.0
500k 10 2.00.4 1.80.0 2.61.2 2.50.9 14.36.0 27.34.6 24.74.8

Table 10: Results of A DA M ATCH +TAPT and F LEX M ATCH +TAPT on YAHOO ! A NSWER with two different
labelled sizes.

YAHOO ! A NSWER
500 2000
A DA M ATCH 68.00.7 69.50.3
+ TAPT 68.21.0 69.80.3
F LEX M ATCH 66.60.7 68.70.4
+ TAPT 66.71.2 69.00.5
S UPERVISED 65.40.3 68.50.3
+ TAPT 68.80.7 71.50.3
F ULLY-S UPERVISED. 75.30.2
+ TAPT 75.40.1

Table 11: We further verify our conclusion on F LEX M ATCH +TAPT. We report the average Macro-F1 score on the
test set across five seeds, with standard deviations as subscripts. Blue represents the best results for each row.

Dataset #Unl. #Lab. F LEX M ATCH + TAPT F LEX M ATCH TAPT S UPERVISED
1k 10 17.04.9 15.72.6 29.65.8 23.54.5
1k 20 39.42.0 32.79.1 38.94.1 34.13.6
1k 100 55.21.8 54.21.4 59.70.8 57.41.6
1k 500 62.00.7 61.91.1 65.80.3 65.50.2
YAHOO ! A NSWER
20k 10 4.01.4 7.34.7 32.45.6 27.24.4
100k 10 5.16.1 2.91.1 30.83.8 28.05.0
500k 10 2.51.1 2.50.9 27.34.6 24.74.8
Hyperparameter Assignment
number of steps 100 epochs
batch size 256
maximum learning rate 1e-06, 1e-4
learning rate optimizer AdamW
Adam epsilon 1e-6
Adam beta weights 0.9, 0.98
learning rate scheduler Warmup linear
Weight decay 0.01
Warmup proportion 0.06
learning rate decay linear

Table 12: Hyperparameters for task-adaptive pretraining. The learning rate and unlabelled size are tightly con-
nected and need to be adjusted together. We generally recommend increasing the learning rate as you increase
the unlabelled size. Different from its predecessor, B ERT (Devlin et al., 2019), where the next sentence prediction
objective is used, RO BERTA (Liu et al., 2019) is only trained with the M LM objective (i.e., cross-entropy loss on
predicting randomly masked tokens), dynamically changing the masking pattern applied to the training examples
and typically using the masking probability of 0.15.

Hyperparameter Assignment
number of steps 10 or 50 epochs
batch size 16 or 32
maximum learning rate 2e-05
learning rate optimizer AdamW
maximum sequence length 256
learning rate scheduler Warmup linear
Warmup proportion 0.06
learning rate decay linear

Table 13: Hyperparameters for fine-tuning. More epochs are used when the labelled size is low.

Hyperparameter Assignment
number of steps 25 600 or 51 200 steps
batch size 16
maximum learning rate 2e-05
learning rate optimizer AdamW
maximum sequence length 256
learning rate scheduler Warmup linear
Warmup proportion 0.05
learning rate decay linear

Table 14: Hyperparameters for self training. Algorithm-specific hyperparameters will be released in configuration
files with the code.

You might also like