Semi Supervised Learning Langauge Model
Semi Supervised Learning Langauge Model
1∗
Zhengxiang Shi Francesco Tonolini2 Nikolaos Aletras2,3 Emine Yilmaz1,2
Gabriella Kazai2 Yunlong Jiao2
1
University College London, London, United Kingdom
2
Amazon, London, United Kingdom
3
University of Sheffield, Sheffield, United Kingdom
[Link].19@[Link]
{tonolini,eminey,aletras,gkazai,jyunlong}@[Link]
Table 1: Statistics of datasets. |Y|: # of classes for classification tasks. L: average # of words in input sentence(s).
Note that we only sample examples from the original training set in our experiments.
Table 2: Performance of TAPT, S T approaches and the baselines across five datasets using two different sizes of
the training labelled data. We report average Macro-F1 on the test set across five seeds, with standard deviations
in subscripts. Blue and orange represent the best and second-best performance in a column respectively.
90 62 72
90
90 60
88 88
88 70
86 58 adamatch
86 86 84 68 flexmatch
56
supervised
84 84 82 54 66 tapt
80
102 103 104 102 103 104 102 103 104 103 104 105 103 104 105
Labelled Size Labelled Size Labelled Size Labelled Size Labelled Size
Figure 1: The effect of labelled size on TAPT and S T. Average test Macro-F1 score over 5 seeds is reported. From
the left to the right, TAPT and S T utilizes 23k, 60k, 100k, 250k, and 500k unlabelled samples respectively.
5k 0.2 -1.0
-1.0 -1.2 -1.0 2.9 1.8 0.8 2.0
Labelled Size
1k 25
2k
10
20
2k 0.7 -0.2 0.7 1.6 10 10
200 -0.3 -1.6 -2.1 -1.8 1k -0.2 0.9 0.1 500 3.8 2.2 1.2 -0.3 0.8
15
500 1.8 0.7 0.1 1.5 2.2
20 0.0 -3.0 -6.1 -9.6 -7.0 100 5.4 1.9 -1.0 -0.6 -1.4
0 5 5
4 26.1 29.8 29.8 30.1 23.5 40 21.4 15.7 22.6 29.0 40.0 0
20 6.3 9.6 12.7 15.5 7.5 10 10.8 13.6 6.4 12.9 12.2
20 5 5 5
0
0
2k
0
2k
1k
5k
0k
0k
1k
5k
0k
0k
10
23
10
60
20
20
10
50
10
50
Unlabelled Size Unlabelled Size
10
25
10
50
Unlabelled Size Unlabelled Size
Figure 3: Performance difference between TAPT and S T with varying labelled and unlabelled sizes on IMDB,
SST-2, A MAZON R EVIEW and YAHOO ! A NSWER. Positive values indicate that TAPT performs better, while
negative values indicate that S T performs better. Average Macro-F1 score on test sets over five seeds is reported.
Macro-F1 (%)
5 Exploring the limits of S T and TAPT 60
In §4, our experimental results showed inconsistent 45.3 45.5 48.5
43.0
results across datasets. For example, S T performs 40
IMDB SST-2 viewYahoo Answers
better on IMDB while TAPT achieves better results Amazon Re
on A MAZON R EVIEW and YAHOO ! A NSWER. We
Figure 4: The performance of TAPT against the S U -
hypothesize that this might be attributed to the ex-
PERVISED baseline in the low-resource setting of unla-
posure to different sizes of labelled or unlabelled belled data. From the left to the right, TAPT utilizes
data. To verify this hypothesis and shed light on 100, 100, 1 000, and 1 000 unlabelled samples respec-
the differences in performance between datasets, tively.
we compare TAPT and S T (using A DA M ATCH and
F LEX M ATCH as representatives) by sampling dif-
ferent labelled and unlabelled sizes in IMDB, SST- in the low-resource setting of unlabelled data, we
2, A MAZON R EVIEW and YAHOO ! A NSWER. select the performance of TAPT and S UPERVISED
Figure 3 visualizes the differences in perfor- from the first column (the lowest unlabelled size)
mance between TAPT and S T, where each cell rep- for each dataset in Figure 3 and plot their average
resents the macro-F1 performance difference of performance over different labelled sizes. Figure 4
TAPT over S T (averaged across five seeds). In each shows that TAPT improves over the S UPERVISED
case, the highest performance among F LEX M ATCH baseline with just one hundred or one thousand
and A DA M ATCH is selected to represent the perfor- samples. For instance, TAPT achieves a 5.5% in-
mance of S T. Overall, we observe that: (1) TAPT crease in F1 score compared to the S UPERVISED
improves the fine-tuning performance even with a baseline when using only 1k unlabelled samples on
few hundred unlabelled examples; and (2) TAPT YAHOO ! A NSWER. Additionally, this performance
performs more stable across the different labelled is achieved without the need for large amounts of
and unlabelled data sizes than S T approaches. Be- tokens in each sample, as training samples from
low we provide a comprehensive analysis of the SST-2, on average, contain only 9 tokens and train-
impact of labelled and unlabelled sizes. ing samples from YAHOO ! A NSWER contain about
32 tokens (see examples in Table 6 of Appendix).
#1. TAPT works even with a few hundred un-
labelled samples. It is generally assumed that #2. Scarce labelled data and adequate un-
TAPT requires a large amount of unlabelled data to labelled data. TAPT appears to be a more
perform well (e.g. Li et al., 2021b; Hou et al., 2022). favourable choice than S T approaches in this set-
However, we surprisingly observe that TAPT can ting. The bottom of each sub-figure in Figure
bring substantial improvement over S UPERVISED 3 shows a clear labelled size boundary, below
baseline even with a relatively small number of which F LEX M ATCH and A DA M ATCH are outper-
unlabelled samples, as shown in Figure 5. To ex- formed by TAPT with a large margin, regardless
plore the effectiveness of TAPT over S UPERVISED of datasets and unlabelled size used. This sug-
Amazon Review Yahoo Answers
65 #Unl. 10 50 100 500
75
F LEX M ATCH 57.317.9 35.23.4 45.122.5 33.40.1
60
A DA M ATCH 53.322.1 36.86.1 33.50.2 33.60.3
70
Table 3: Test results on IMDB with 4 fixed labelled
Macro-F1 (%)
50
60
TAPT against five different sizes of unlabelled data,
45 grouped by size (using similar colours). We note
55 that S T approaches perform worse than their corre-
103 104 105 103 104 105 sponding S UPERVISED baselines (represented by
Unlabelled Size Unlabelled Size
Labelled Size | Algorithm horizontal lines) until a certain amount of unla-
50k | st 2k | tapt st 10k | Supervised
50k | tapt 500 | st tapt 2k | Supervised belled data has been reached. For example, when
10k | st 500 | tapt Fully | Supervised 500 | Supervised
10k | tapt 100 | st 50k | Supervised 100 | Supervised the labelled size is 500, S T requires about 20k
2k | st 100 | tapt
unlabelled samples to achieve the corresponding
Figure 5: The performance of S T and TAPT using dif- S UPERVISED baseline performance on YAHOO !
ferent unlabelled sizes. Average test results across five A NSWER. On the other hand, TAPT generally out-
seeds are reported where the best result from F LEX - performs S UPERVISED baselines demonstrating its
M ATCH and A DA M ATCH is selected to represent S T. robustness across various unlabelled sizes.
To further quantify the model performance in
case of scarce unlabelled and adequate labelled
gests that S T might not be able to efficiently handle
data, we choose the three lowest unlabelled sizes
large amounts of unlabelled data if labelled data
(the first three columns) excluding the lowest la-
do not provide adequate information. This might
belled size (the last row) in Figure 3 for each
be attributed to confirmation bias (Tarvainen and
dataset. Our analysis shows that S T has 67%, 56%
Valpola, 2017; Arazo et al., 2020), which results
and 54% probability of falling below the S UPER -
from the accumulation of errors in the iterative S T
VISED baselines on SST-2, A MAZON R EVIEW ,
process caused by incorrect pseudo-labels.
and YAHOO ! A NSWER respectively. Even on
The specific value of adequate labelled size
IMDB where S T generally performs well, it still
boundary for S T approaches depends on the na-
has a probability of 33% to fall behind S UPER -
ture of the dataset. For example, even though both
VISED . In contrast, TAPT never performs worse
IMDB and SST-2 are binary classification tasks
than S UPERVISED in those cases. We provide com-
for movie review sentiment analysis, the labelled
putation details and comparative statistics in Ap-
size boundary for SST-2 is higher (40 > 4), indi-
pendix §C.
cating that this boundary tends to increase as the
The specific value of adequate unlabelled size
task becomes more challenging. While it may be
boundary for S T approaches depends on the nature
easy to obtain dozens of labelled data in this case,
of the dataset as well as the labelled size. Figure
when the task becomes more intricate or contains
5 illustrates that as the size of the labelled data
noisy weak labels, it is important to be aware of this
increases, S T approaches require more unlabelled
potential issue with S T approaches. TAPT could
data to surpass the S UPERVISED baselines. For
serve as an alternative in situations where collect-
example, on A MAZON R EVIEW, S T trained with
ing adequate labelled data for training is costly.
100 labelled samples requires about 5k unlabelled
We provide specific values of the performance of
samples to perform better than S UPERVISED, while
S T and TAPT, and further verify that this finding
S T trained with 10k labelled samples requires about
applies to other S T approaches in Appendix §D.
100k unlabelled samples. Adjusting the unlabelled
size accordingly might be conducive to exploiting
#3. Adequate labelled data and scarce unla-
the full potential of S T approaches.
belled data. In this setting, TAPT is more ro-
bust, while S T has a greater chance of performing #4. Scarce labelled and unlabelled data. When
worse than the S UPERVISED baseline. In Figure the labelled data is insufficient, increasing unla-
5, we plot the performance of S T approaches and belled size is not helpful or even detrimental to S T
Target Domain: IMDB | Labelled Size: 100 Target Domain: IMDB | Labelled Size: 200 Target Domain: SST-2 | Labelled Size: 100 Target Domain: SST-2 | Labelled Size: 200
IMDB SST-2 IMDB SST-2 SST-2 IMDB SST-2 IMDB
100 93.4 94.7 90.9 89.0 88.7
100 93.5 93.6 91.5 90.1 90.4 87.0 100 100
84.5 86.4 89.6 84.0 85.7 87.2 89.5 88.6 89.7 86.8 87.5
Macro-F1 (%)
83.0 83.1
80 80 80 80
60 60 60 60
40 33.3 33.3 40 33.3 33.3 40 33.4 33.4 40 33.0 33.0
20 20 20 20
0 0 0 0
FlexMatch AdaMatch TAPT Supervised FlexMatch AdaMatch TAPT Supervised FlexMatch AdaMatch TAPT Supervised FlexMatch AdaMatch TAPT Supervised
Figure 6: Results of U DA experiments. Legends indicate domains of labelled training data. Orange/green repre-
sents the performance with/without domain shift. Average Macro-F1 score on test sets over five seeds is reported.
Train (Lab.) Train (Unl.) #Lab. F LEX M ATCH A DA M ATCH TAPT S UPERVISED
IMDB 100 93.40.1 94.70.2 90.90.6 88.70.2 F
IMDB SST-2 100 89.11.2 (H4.6%) 87.62.2 (H7.5%) 89.90.6 (H1.1%) 88.70.2
A MAZON R EVIEW 100 92.10.7 (H1.4%) 92.40.2 (H2.4%) 91.40.3 (N0.6%) 88.70.2
IMDB 200 93.50.1 93.60.1 91.80.3 90.30.4 F
IMDB SST-2 200 89.52.4 (H4.3%) 88.91.0 (H5.0%) 90.30.4 (H1.6%) 90.30.4
A MAZON R EVIEW 200 92.50.4 (H1.1%) 92.70.5 (H1.0%) 92.10.2 (N0.3%) 90.30.4
SST-2 100 83.08.3 83.14.4 85.42.4 84.02.7 F
SST-2 IMDB 100 46.72.1 (H43.7%) 49.27.3 (H40.8%) 88.50.9 (N3.6%) 84.02.7
A MAZON R EVIEW 100 46.44.9 (H44.1%) 48.211.0 (H42.0%) 88.90.9 (N4.1%) 84.02.7
SST-2 200 87.23.9 89.50.9 88.60.9 86.80.3 F
SST-2 IMDB 200 62.77.4 (H28.1%) 61.02.8 (H31.8%) 89.11.1 (N0.6%) 86.80.3
A MAZON R EVIEW 200 61.87.7 (H29.1%) 56.010.3 (H17.4%) 89.41.0 (N0.9%) 86.80.3
Table 4: Results of S TL experiments. We report the average Macro-F1 score on the test set across five seeds, with
standard deviations as subscripts. Blue represents the best result for each row. Stars highlight rows without domain
shifts. Arrows in colours stand for the changes in performances against the star row result within each cell.
iew
T-2
r
s
we
ew
IM
SS
ev
ns
N
nR
!A
AG
oo
azo
Yah
Am
Figure 7: Vocabulary overlap (%) across datasets.
tial reason for the performance of S T and TAPT in concept of curriculum learning (Bengio et al., 2009)
the S TL setting (§6). to flexibly adjust thresholds for different classes at
each time step and select unlabelled data and their
B S T Frameworks pseudo labels that are more likely to be informative.
VAT. VAT (Miyato et al., 2018) proposed a regu- A DA M ATCH. A DA M ATCH (Berthelot et al.,
larization technique that forces pairs of data points 2022) aims to solve domain adaptation problems
that are very close in the input space to be close in S SL and build a high-accuracy model that
to each other in the output space. VAT adds small trains on and tests on different data distributions.
perturbation to the input data and forces the model A DA M ATCH builds on F IX M ATCH and introduces
to produce similar predictions. a relative confidence threshold and a modified dis-
tribution alignment from (Berthelot et al., 2019a).
F IX M ATCH. F IX M ATCH (Sohn et al., 2020)
generates artificial labels using both consistency C Probability of performing worsen than
regularization and pseudo-labelling, where the S UPERVISED.
artificial labels are produced based on weakly-
In §5, we discuss that we select the model perfor-
augmented unlabelled data. These artificial la-
mance with the three lowest unlabelled sizes (the
bels are then used as targets to train the model on
first three columns in Figure 3) for each dataset
strongly-augmented unlabelled data. F IX M ATCH
and exclude the model performance with the low-
only retains an artificial label if the model assigns
est labelled size (the last row in Figure 3). This
a high probability to one of the possible classes.
results in 9 cells in IMDB, 3 cells in SST-2, 9
DASH. DASH (Xu et al., 2021b) extends F IX - cells in A MAZON R EVIEW, and 12 cells in YA -
M ATCH by introducing a mechanism with a dynam- HOO ! A NSWER , where TAPT has one run per cell
ically adjusted threshold of loss to select a subset and S T (F LEX M ATCH and A DA M ATCH) has two
of training examples from the unlabelled data for runs per cell. We consider a run to be a failure
performing S SL. if its performance is worse than its corresponding
S UPERVISED baseline.
F LEX M ATCH. F LEX M ATCH (Zhang et al., Table 8 lists the probability of S T and TAPT
2021) also extends F IX M ATCH by introducing the of falling below the S UPERVISED baseline with
selected combinations of labelled and unlabelled F Implementation Details
sizes.
We consistently use five random seeds, ranging
from 1 to 5, for all algorithms. The sampled la-
D Further validations with other S T belled data is the same for all algorithms for a
approaches given seed. The development and test sets remain
unchanged for all different labelled and unlabelled
In this section, we conduct additional experiments data sizes.
on S T approaches, including VAT, DASH, and F IX - Our model implementation uses open-source
M ATCH to demonstrate that our findings are appli- libraries including HuggingFace Transformers2 ,
cable to other S T approaches as well. Fairseq3 , and USB4 . Our experiments of TAPT are
In Table 9, we select several combinations of performed on 8x32GB V100 GPUs, with a batch
labelled and unlabelled sizes on IMDB, SST- size of 16 per device and 2 gradient accumulation
2, A MAZON R EVIEW, and YAHOO ! A NSWER steps.
datasets. Our experimental results show that other Table 12 lists the hyperparameters used for the
S T approaches do not perform well when the la- TAPT phrase. Table 13 lists the hyperparameters
belled size is low, and that other S T approaches used for the fine-tuning phrase. Table 14 lists the
have a high probability to perform worsen than hyperparameters used for S T approaches.
S UPERVISED baselines when the unlabelled size
is low. This suggests that poor performance when
the labelled or unlabelled size is inadequate may
be a common problem of state-of-the-art S T ap-
proaches.
Dataset Example
IMDB I watched this movie after seeing other comments on IMDb, even convincing my wife that it was a "unique horror
movie." I wanted to like this movie, but was unable [Link] "love story" was good, but the horror aspect was quite bad. If
the story was just about a young man who fell in love with a girl suffering from parasomnia, then it would have been a
better [Link] care centre stretched credulity well past the limits, in fact it was quite ridiculous. The doctor happily
ignors privacy laws and professionalism. A nurse goes into a room for a routine feeding of a dangerous patient (without
security escort), and drops the tray and runs out of the room screaming for no apparent reason. The forensic patient (and
the film’s villain) is tied up in a standing position fully clothed - apparently for years? None of it makes much [Link]
movie even had some actors that I’ve liked in other things, such as the detectives, but still I can’t recommend this movie.
SST-2 a rewarding work of art for only the most patient and challenge-hungry moviegoers.
AG N EWS Teen flies in plane #39;s landing gearA homeless teenager who hid in the landing gear of a passenger plane survived
a 700-kilometre flight across south-western China but his companion fell and probably died, state media reported on
Friday.
A MAZON R EVIEW THIS is MUSIC at its BESTRob Dougan has done it. He’s crafted musical perfection, or close to it anyway. I have
finally found the music I’ve been waiting for my whole life in this album - Rob D you are a genius. I think a lot of us
wanted to know more about this guy as soon as we heard the track playing to the "Woman in the Red Dress" scene. Now
I know why the Wachowski brothers have enlisted his musical talents to flesh out their movies.I know I should be trying
to write a more helpful, objective review but I can do nothing but wax poetic for Rob Dougan and his debut album. He
has mixed classical melodies with awesome electric beats and it all comes together in an audio orgy. Just buy the album
already and let’s get Rob some more mainstream recognition.
YAHOO ! A NSWER Does anybody know a great deal about angels? I’m looking for names, if they’re good or bad, what they look like, etc.
The more detail the better. All religions accepted
Table 7: Similarity analysis between IMDB and A MAZON R EVIEW with four examples that highlight the overlap.
Table 8: Results on the effect of low unlabelled sizes on S T and TAPT. Failure means performing worsen than
S UPERVISED.
Dataset #Unl. #Lab. VAT F IX M ATCH DASH F LEX M ATCH A DA M ATCH TAPT S UPERVISED
100 4 33.50.2 33.40.1 33.40.1 35.74.2 34.10.7 61.86.7 59.44.8
100 10 61.620.1 45.421.6 34.72.2 49.019.9 52.421.0 75.56.9 71.88.5
100 20 87.12.2 64.616.5 67.816.6 85.52.9 79.17.6 85.51.0 84.11.9
IMDB 500 4 33.40.0 33.40.1 33.40.1 33.40.1 33.60.3 63.47.2 58.27.1
2k 4 33.30.0 33.30.0 33.30.0 33.30.0 33.30.0 63.16.2 60.95.6
10k 4 33.30.0 33.50.3 33.30.0 34.01.2 33.60.4 64.18.9 62.47.9
23k 4 33.30.0 33.30.0 57.429.4 45.323.9 33.30.0 68.85.6 65.610.4
100 40 63.310.6 46.99.7 47.97.0 57.24.5 51.014.0 78.72.5 76.43.7
500 40 55.716.8 53.88.9 51.210.0 67.710.7 59.111.4 83.34.8 72.97.9
500 200 83.01.6 84.52.8 82.63.5 83.83.0 87.41.9 88.80.9 88.30.9
SST-2
2k 40 55.924.2 36.43.0 35.32.0 56.66.7 49.313.8 79.35.9 71.78.2
10k 40 73.520.5 38.911.4 35.62.6 56.912.5 36.22.9 85.91.0 78.57.5
60k 40 79.613.4 32.61.7 33.40.6 40.67.7 42.613.3 82.64.0 75.37.2
1k 20 13.55.2 14.95.6 20.33.0 25.83.2 20.71.1 32.01.8 32.52.2
1k 100 46.12.2 36.33.1 35.36.2 43.41.7 40.32.2 48.50.9 48.22.2
1k 500 52.60.2 50.81.5 49.51.0 54.11.0 52.81.1 55.90.3 55.30.5
A MAZON R EVIEW 5k 20 15.57.8 13.53.3 22.25.2 23.27.3 16.96.9 32.83.4 32.32.5
20k 20 19.37.5 15.23.9 20.56.4 19.110.0 19.36.3 32.03.2 31.63.6
100k 20 14.17.3 11.92.9 20.75.2 15.32.6 12.53.7 30.73.6 30.83.9
250k 20 10.35.0 10.93.6 22.05.7 22.74.9 14.45.6 30.22.4 32.13.1
1k 10 1.90.1 2.00.1 4.62.9 15.72.6 18.87.9 29.65.8 23.54.5
1k 20 6.72.8 10.14.2 9.63.2 32.79.1 28.85.8 38.94.1 34.13.6
1k 100 55.21.7 46.94.4 45.33.7 54.21.4 53.91.3 59.70.8 57.41.6
1k 500 59.20.4 61.60.6 60.71.3 61.91.1 61.50.9 65.80.3 65.50.2
YAHOO ! A NSWER
5k 10 1.80.0 3.22.6 3.72.7 16.410.8 17.811.7 31.45.1 25.73.9
20k 10 2.40.9 2.00.3 4.93.1 7.34.7 25.212.2 32.45.6 27.24.4
100k 10 2.30.6 3.82.5 3.42.9 2.91.1 17.711.4 30.83.8 28.05.0
500k 10 2.00.4 1.80.0 2.61.2 2.50.9 14.36.0 27.34.6 24.74.8
Table 10: Results of A DA M ATCH +TAPT and F LEX M ATCH +TAPT on YAHOO ! A NSWER with two different
labelled sizes.
YAHOO ! A NSWER
500 2000
A DA M ATCH 68.00.7 69.50.3
+ TAPT 68.21.0 69.80.3
F LEX M ATCH 66.60.7 68.70.4
+ TAPT 66.71.2 69.00.5
S UPERVISED 65.40.3 68.50.3
+ TAPT 68.80.7 71.50.3
F ULLY-S UPERVISED. 75.30.2
+ TAPT 75.40.1
Table 11: We further verify our conclusion on F LEX M ATCH +TAPT. We report the average Macro-F1 score on the
test set across five seeds, with standard deviations as subscripts. Blue represents the best results for each row.
Dataset #Unl. #Lab. F LEX M ATCH + TAPT F LEX M ATCH TAPT S UPERVISED
1k 10 17.04.9 15.72.6 29.65.8 23.54.5
1k 20 39.42.0 32.79.1 38.94.1 34.13.6
1k 100 55.21.8 54.21.4 59.70.8 57.41.6
1k 500 62.00.7 61.91.1 65.80.3 65.50.2
YAHOO ! A NSWER
20k 10 4.01.4 7.34.7 32.45.6 27.24.4
100k 10 5.16.1 2.91.1 30.83.8 28.05.0
500k 10 2.51.1 2.50.9 27.34.6 24.74.8
Hyperparameter Assignment
number of steps 100 epochs
batch size 256
maximum learning rate 1e-06, 1e-4
learning rate optimizer AdamW
Adam epsilon 1e-6
Adam beta weights 0.9, 0.98
learning rate scheduler Warmup linear
Weight decay 0.01
Warmup proportion 0.06
learning rate decay linear
Table 12: Hyperparameters for task-adaptive pretraining. The learning rate and unlabelled size are tightly con-
nected and need to be adjusted together. We generally recommend increasing the learning rate as you increase
the unlabelled size. Different from its predecessor, B ERT (Devlin et al., 2019), where the next sentence prediction
objective is used, RO BERTA (Liu et al., 2019) is only trained with the M LM objective (i.e., cross-entropy loss on
predicting randomly masked tokens), dynamically changing the masking pattern applied to the training examples
and typically using the masking probability of 0.15.
Hyperparameter Assignment
number of steps 10 or 50 epochs
batch size 16 or 32
maximum learning rate 2e-05
learning rate optimizer AdamW
maximum sequence length 256
learning rate scheduler Warmup linear
Warmup proportion 0.06
learning rate decay linear
Table 13: Hyperparameters for fine-tuning. More epochs are used when the labelled size is low.
Hyperparameter Assignment
number of steps 25 600 or 51 200 steps
batch size 16
maximum learning rate 2e-05
learning rate optimizer AdamW
maximum sequence length 256
learning rate scheduler Warmup linear
Warmup proportion 0.05
learning rate decay linear
Table 14: Hyperparameters for self training. Algorithm-specific hyperparameters will be released in configuration
files with the code.