0% found this document useful (0 votes)
54 views14 pages

How Should We Increase The User's Training Data To Be Enough For Fine-Tuning?

The document introduces LLM2LLM, a targeted and iterative data augmentation framework designed to enhance training datasets for fine-tuning language models. It highlights the limitations of traditional data augmentation methods and demonstrates LLM2LLM's effectiveness in improving model performance across various NLP tasks, particularly in low-data regimes. Empirical results show significant accuracy improvements when using LLM2LLM compared to standard fine-tuning approaches.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views14 pages

How Should We Increase The User's Training Data To Be Enough For Fine-Tuning?

The document introduces LLM2LLM, a targeted and iterative data augmentation framework designed to enhance training datasets for fine-tuning language models. It highlights the limitations of traditional data augmentation methods and demonstrates LLM2LLM's effectiveness in improving model performance across various NLP tasks, particularly in low-data regimes. Empirical results show significant accuracy improvements when using LLM2LLM compared to standard fine-tuning approaches.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

of task-specific data.

Often, collecting, cleaning, • We benchmark LLM2LLM on randomly sam-


and labeling additional data can be costly and time- pled subsets of GSM8K (Cobbe et al., 2021),
consuming. So the key question is: how should we CaseHOLD (Zheng et al., 2021), SNIPS (Coucke
increase the user’s training data to be enough for et al., 2018), TREC (Li and Roth, 2002) and
fine-tuning? SST-2 (Socher et al., 2013) in order to evaluate
Data augmentation is a known method that could the effectiveness of our approach in the low-data
help effectively expand the training dataset. For regime (Section 4.2). Here, we get up to a 24.2%
natural language processing (NLP) tasks, one can improvement on GSM8K, 32.6% on CaseHOLD,
use approaches such as synonym replacement, char- 32.0% on SNIPS, 52.6% on TREC, and 39.8%
acter replacement (e.g., by intentionally introduc- on SST-2 (Table 1).
ing spelling errors), random swapping, and back • We conduct a series of ablations studies compar-
translation, just to name a few (Wei and Zou, 2019; ing LLM2LLM to several existing baselines as
Belinkov and Bisk, 2017; Coulombe, 2018; Zhang well as to variants of LLM2LLM to evaluate
et al., 2018). However, these approaches fail to the effectiveness of our design decisions (Sec-
effectively expand the training data for fine-tuning tion 4.5). We observe that both the iterative and
LLMs in the case of new and specialized tasks, as targeted nature of LLM2LLM are critical to im-
we will show later in Section 4.3. proving model performance.
To address this, several recent papers have ex-
plored using an LLM to expand the fine-tuning 2 Background and Related Work
dataset (Dai et al., 2023; Kumar et al., 2020; Zhou
2.1 Instruction Following LLMs
et al., 2023; Chen et al., 2023; Cao et al., 2023; Wei
et al., 2023; Zhu et al., 2023). This approach has The earliest works (Wei et al., 2021; Longpre et al.,
proven to be more effective than traditional data 2023; Chung et al., 2022; Aribandi et al., 2021;
augmentation methods. However, these approaches Sanh et al., 2021; Muennighoff et al., 2023; Wang
often apply LLM-based data augmentation on all et al., 2022b; Mishra et al., 2022; Wang et al.,
of the available training dataset, without consider- 2022a; Xu et al., 2022) in instruction fine-tuning
ing the LLM’s prediction accuracy on individual involved gathering and processing different exist-
training data points. We have observed that for var- ing NLP datasets in order to improve the perfor-
ious reasoning tasks such as arithmetic and reading mance of LLMs on a wide range of tasks. Self-
comprehension, the LLM correctly solves simpler Instruct (Wang et al., 2023) removed the reliance
examples in the fine-tuning dataset, but may strug- on existing datasets by introducing a framework for
gle with harder examples. It will be sub-optimal to bootstrapping instruction datasets with the outputs
keep augmenting data points for which the LLM is of the model itself. Follow-up work (Ouyang et al.,
already achieving high accuracy on. 2022; Taori et al., 2023; Geng et al., 2023; Chiang
To address these challenges, we introduce et al., 2023; Xu et al., 2023; Mukherjee et al., 2023;
LLM2LLM, a new targeted and iterative data aug- Mitra et al., 2023; Kang et al., 2023; Nori et al.,
mentation framework that uses a teacher LLM to 2023) took advantage of stronger models (Achiam
expand the training dataset, with a targeted and et al., 2023; Touvron et al., 2023a,b) in order
iterative approach. In more detail, we make the to fine-tune stronger general-purpose instruction-
following contributions: following models.

• We propose LLM2LLM, a targeted and itera- 2.2 Self-Improving LLMs


tive LLM-based data augmentation technique Various early works (Zelikman et al., 2023; Halupt-
that efficiently and effectively augments small zok et al., 2023; Zelikman et al., 2022; Madaan
task-specific datasets. LLM2LLM achieves this et al., 2023; Gulcehre et al., 2023; Singh et al.,
by (1) fine-tuning a student LLM on the initial 2023) explore using self-improvement for fine-
dataset, (2) evaluating on the training data and tuning LLMs. These works generally filtered the
extracting data points which the model got incor- outputs of the model before fine-tuning it on its
rect after training, and (3) using a Self-Instruct own outputs. LLM2LLM differs from these meth-
(Wang et al., 2023) style data augmentation to ods, as we do not directly fine-tune on the outputs
augment these data points, which are then added of our own model, and we employ a teacher model
back into the training data (Section 3.1). to provide feedback in the form of synthetic data.
Algorithm 1 LLM2LLM: Boosting LLMs with we consider the following iterative process:
Novel Iterative Data Enhancement. Given a seed
dataset D0 , we finetune the model Mistudent , eval- Dn+1 = f (Mteacher , Mstudent , Dn , · · · , D0 ). (1)
uate, and extract training set data points that the
model gets wrong. These are used to generate In Equation (1), Mteacher is the teacher model,
new training data points using the teacher model Mstudent is the student model (potentially being
Mteacher for the next step. fine-tuned in many iterations), n refers to the nth
step of data augmentation, Dn+1 is the new train-
1: procedure LLM2LLM(M0student , Mteacher , D0 )
2: i ← 0 ing dataset at the next iteration, and f is the data-
3: while i < n do generation algorithm. At each step, the teacher
4: Mistudent ← Finetune(M0student , Di ) model has access to how the student model per-
5: E i ← Evaluate(Mistudent , D0 ) ▷ Evaluate on seed data
6: W i ← Filter(E i ,D0 ) ▷ Keep wrong answers forms at the nth step (e.g., correct/incorrect labels,
7: Ai ← Generate(Mteacher , W i ) ▷ Augment using teacher or possibly prediction distributions for white-box
8: Di+1 ← Di + Ai ▷ Append to data
9: i←i+1
models), and based on that it can edit training data
10: end while points for the next iteration.

11: Evaluate Mstudent Note that LLM2LLM is different from knowl-
12: end procedure
edge distillation (Hinton et al., 2015). Knowledge
distillation is generally applicable to cases where
the teacher model has high accuracy on the target
seed dataset D, where D potentially has unseen data. In contrast, in this case, it is possible that
characteristics, compared to the pre-trained dataset the teacher model also performs sub-optimally on
(e.g., a medical dataset with specific terminology, the target data (e.g., in the private database case,
or a private database with specific characteristics). where the teacher lacks domain-specific knowl-
In this case, the model’s zero-shot or fine-tuned edge). However, if the teacher model has enough
performance is likely to be unsatisfactory. While reasoning capability to produce conceptually simi-
strategies to address this challenge have been ex- lar but semantically different examples when it is
plored, e.g., through enhanced few-shot learning given both the prompt and answer, then our frame-
methods as discussed in Section 2, here we strictly work can improve performance.
focus on enriching the provided target dataset D In LLM2LLM, we consider a specific instantia-
with an LLM. This method is orthogonal to the tion of Equation (1), as discussed next.
aforementioned techniques, offering a complemen-
tary solution that can be applied alongside them. 3.1 LLM2LLM
To enrich D, AugGPT (Dai et al., 2023) has The end-to-end algorithm of LLM2LLM is pre-
introduced a promising approach that generates ad- sented in Algorithm 1. Inspired by Self-Instruct
ditional augmented data by applying a prompted (Wang et al., 2023), we use the teacher model
LLM to all available data points in the target train- Mteacher to generate synthetic data from the data
ing dataset. However, this method falls short by points that the model got incorrect during training
indiscriminately augmenting data without consider- in order to target these deficiencies in the student
ing the student model’s varying performance across model. In more detail, we first train the baseline
different data points. For instance, the model may student model Mstudent on the provided target data
easily solve the majority of the dataset, but it may D0 , and we evaluate its performance (lines 4-5 of
struggle with a small subset of more challenging Algorithm 1). We then filter the results and keep the
examples. In this case, rather than indiscriminately incorrect training examples that the student model
expanding the dataset by replicating simpler cases, struggled to answer correctly (E i in line 6). Then
a better augmentation strategy would be to generate the teacher model is prompted to create additional
more data points that align conceptually with these training data points that are conceptually aligned
challenging examples. This is because the former but semantically different (line 7, see Section B.4
approach could lead to longer training time without for specifics on the prompt). The teacher model
noticeable performance improvement. does not necessarily need to be bigger, although
Here, we propose a more general formulation that could potentially improve performance. The
of an LLM-based data augmentation pipeline that primary requirement for the teacher model is to
addresses the aforementioned limitation. To do so, have reasoning capability to be able to follow the
LLM2LLM on GSM8K test set with varying seed data LLM2LLM on CaseHOLD test set with varying seed data
30
25 70
Accuracy (%)

Accuracy (%)
20 60
15
74 (1%) seed data 50 225 (0.5%) seed data
10
149 (2%) seed data 450 (1%) seed data
5 373 (5%) seed data 40 900 (2%) seed data
747 (10%) seed data 2250 (5%) seed data
0
0 500 1000 1500 2000 2500 3000 500 1000 1500 2000 2500
Total Data Size (Seed + LLM2LLM Data) Total Data Size (Seed + LLM2LLM Data)

Figure 2: LLM2LLM on GSM8K (left) and CaseHOLD (right) with various seed data sizes. Each line shows
the test accuracy of the finetuned Llama-2-7B model with each step of LLM2LLM with varying seed dataset
size. The first (left-most) data point on each line represents finetuning only on the seed data. Each point afterward
corresponds to the performance after one more iteration of LLM2LLM. The total data size (x-axis) represents the
total amount of seed plus LLM2LLM data that was used to train the model at that step. By applying LLM2LLM
with low amounts of seed data and iteratively improving the training dataset, we can attain significant performance
improvements. In particular, we can see that running LLM2LLM can match or even exceed the performance of
simply annotating more real data in some cases (detailed breakdown provided in Table 1).

W i during dataset generation to only data from used to evaluate final model accuracy. This ad-
the original seed data allows us to bound the total dresses the issue of potential test data leakage that
number of training data points to could have happened if the teacher model had been
trained on similar data.
j
X
j
|D | = n + npi ≤ n(1 + jpmax ), 4.2 Main Results
i=0
Here, we discuss LLM2LLM’s performance with
which has an upper-bound that grows linearly with varying amount of training data by presenting re-
the number of steps. The empirical evaluations sults for fine-tuning Llama-2-7B on GSM8K using
shown in Section 4.5.2 (Table 4) corroborates this. GPT-3.5 as the teacher model. We then discuss how
these trends extend to different datasets (Table 1).
4 Results The final model accuracy after applying 10 iter-
ations of LLM2LLM is given in Table 1. For a
4.1 Experimental Setup low-data regime with 74 available examples (i.e.,
To evaluate the performance of LLM2LLM, we 1% of the GSM8K training dataset), vanilla fine-
applied our framework to fine-tune Llama-2-7B on tuning achieves only 0.99% test accuracy. However,
various target datasets, including GSM8K (Cobbe LLM2LLM boosts the accuracy to 19.56% by iter-
et al., 2021), CaseHOLD (Nori et al., 2023), SNIPS atively generating 391 additional examples based
(Coucke et al., 2018), TREC (Li and Roth, 2002) on data points where the model makes mistakes.
and SST-2 (Socher et al., 2013). We subsampled With slightly more available data of 149 seed ex-
these datasets with different sampling rates from amples (i.e., 2% of the training dataset) we can
0.02% to 50% to evaluate performance across dif- achieve 25.70% accuracy. As shown in the base-
ferent low-data regimes. Our teacher model for line accuracy with 20% data in Table 2, we would
these results is GPT-3.5 (1106 release) unless other- need over 10× more training data points to match
wise specified. We considered several other teacher this accuracy if we only rely on vanilla fine-tuning.
models, including GPT-3.5, GPT-4-Turbo, Llama- We also highlight that LLM2LLM can lead to no-
2-70B (Touvron et al., 2023b), and Airoboros-L2- ticeable gains with data-sufficient regimes (e.g.,
70B (Durbin, 2023) in Section 4.4. We include 100% data), albeit at a smaller improvement over
a more detailed experimental setup in Section A. the baseline compared to lower-data regimes.
Additionally, we conducted additional experiments We observe a similar trend for CaseHOLD,
(Section B.6) to ensure that the augmented data SNIPS, TREC, and SST-2, where LLM2LLM
from the teacher models differs from the test dataset helps improve performance in the low-data regime.
Dataset Steps Total Aug. Acc. (%) From-scratch Total
Dataset Acc. (%)
Fine-tuning Aug.
1 (one-shot) 490 16.30
GSM8K
10 (iterative) 471 23.73 ✗ 230 14.71
GSM8K
✓ 471 23.75
1 (one-shot) 276 59.94
CaseHOLD
10 (iterative) 198 64.50 ✗ 154 60.50
CaseHOLD
✓ 198 64.50
Table 3: Ablation on the iterative nature of LLM2LLM
with 100 seed data points. Steps refers to the total num- Table 5: Ablation study on whether to fine-tune from
ber of augmentation steps in LLM2LLM. For the case scratch or to do continuous fine-tuning. From-scratch
of 1 iteration, we prompt the teacher model to generate Fine-tuning refers to whether we fine-tune the base
more samples all at once, whereas in the 10 steps case model from scratch vs. fine-tune the previous step’s
the teacher model only generates 1 new data point per model. Total Aug. refers to the total number of augmen-
wrong example. The results clearly show that the latter tated examples generated over 10 steps of LLM2LLM.
iterative approach results in better performance.
well as excessive amounts of total augmented data
Only Aug. Total points, as we demonstrate in Table 4. On GSM8K,
Dataset Acc. (%)
Seed Data Aug.
generating augmented data from the previous it-
✗ 4302 18.32
GSM8K
✓ 471 23.75 eration’s augmented data yields 18.3% accuracy,
while using the seed data for further augmentation
✗ 351 63.75
CaseHOLD improves the accuracy to 23.75%. We observe a
✓ 198 64.50
similar trend for CaseHOLD. As discussed in Sec-
Table 4: Ablation study on whether to augment pre-
tion 3.1, a potential reason for the performance
viously generated LLM2LLM data. Only Aug. Seed
Data refers to augmenting only the seed data vs. also drop, when using augmented data for further aug-
re-augmenting the augmented data. Total Aug. refers mentation, has to do with a deviation from the
to the total number of augmentations generated over 10 original data distribution.
steps of LLM2LLM.
4.5.3 From-scratch Fine-tuning vs
a single iteration, for both the GSM8K and Case- Continuous Fine-tuning
HOLD datasets. As shown in Table 3, using a Another key decision for LLM2LLM is whether to
single augmentation step with a larger amount continue fine-tuning from the last iteration’s check-
of augmented data significantly underperforms point (i.e. continuous fine-tuning) or to restart fine-
the alternative of executing 10 iterative steps of tuning from the pre-trained model at each itera-
LLM2LLM with a smaller number of augmenta- tion (i.e. from-scratch fine-tuning). Considering
tions per iteration. In particular, on GSM8K, aug- the non-convex nature of the optimization target
menting one data point per example over 10 steps and complex loss landscapes, this decision is not
yields a 7.4% higher accuracy than augmenting necessarily obvious. Nevertheless, as shown in Ta-
five data points per example in a single step. Simi- ble 5, we observe that from-scratch fine-tuning con-
larly, on CaseHOLD, iterative augmentation of one sistently and significantly outperforms continuous
data points per example over 10 steps results in a fine-tuning, with up to 9% accuracy improvement.
4.6% improvement over a one-shot augmentation The inferior performance of continuous fine-tuning
with four data points per example. This justifies can be attributed to a potential overfitting to small
the LLM2LLM’s iterative augmentation approach seed data over multiple iterations of fine-tuning,
that generates one data point per each incorrectly especially in lower-data regimes where the seed
answered example. data is small. This can be alleviated by restarting
fine-tuning from scratch in each iteration with suf-
4.5.2 Data Augmentation with Seed Data vs ficient augmented data appended to the seed data
Augmented Data to form the training dataset.
In each iteration, LLM2LLM evaluates the stu-
5 Conclusion
dent model’s performance only on the original seed
dataset and generates augmented data from incor- We have introduced LLM2LLM, an adaptive and
rect seed examples. However, a possible alternative iterative LLM-based data augmentation framework
is performing evaluation and data augmentation us- that uses LLMs to scale up smaller fine-tuning
ing both seed and previously augmented data. The datasets in lieu of manually generating more data.
latter often leads to sub-optimal performance as This framework substantially reduces the amount
Howland, Andrea Hu, Jeffrey Hui, Jeremy Hur- Wang, Mostafa Dehghani, Siddhartha Brahma, et al.
witz, Michael Isard, Abe Ittycheriah, Matthew Jagiel- 2022. Scaling instruction-finetuned language models.
ski, Wenhao Jia, Kathleen Kenealy, Maxim Krikun, arXiv preprint arXiv:2210.11416.
Sneha Kudugunta, Chang Lan, Katherine Lee, Ben-
jamin Lee, Eric Li, Music Li, Wei Li, YaGuang Li, Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian,
Jian Li, Hyeontaek Lim, Hanzhao Lin, Zhongtao Liu, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias
Frederick Liu, Marcello Maggioni, Aroma Mahendru, Plappert, Jerry Tworek, Jacob Hilton, Reiichiro
Joshua Maynez, Vedant Misra, Maysam Moussalem, Nakano, Christopher Hesse, and John Schulman.
Zachary Nado, John Nham, Eric Ni, Andrew Nys- 2021. Training verifiers to solve math word prob-
trom, Alicia Parrish, Marie Pellat, Martin Polacek, lems. arXiv preprint arXiv:2110.14168.
Alex Polozov, Reiner Pope, Siyuan Qiao, Emily Reif,
Bryan Richter, Parker Riley, Alex Castro Ros, Au- Alice Coucke, Alaa Saade, Adrien Ball, Théodore
rko Roy, Brennan Saeta, Rajkumar Samuel, Renee Bluche, Alexandre Caulier, David Leroy, Clément
Shelby, Ambrose Slone, Daniel Smilkov, David R. Doumouro, Thibault Gisselbrecht, Francesco Calta-
So, Daniel Sohn, Simon Tokumine, Dasha Valter, girone, Thibaut Lavril, et al. 2018. Snips voice plat-
Vijay Vasudevan, Kiran Vodrahalli, Xuezhi Wang, form: an embedded spoken language understanding
Pidong Wang, Zirui Wang, Tao Wang, John Wiet- system for private-by-design voice interfaces. arXiv
ing, Yuhuai Wu, Kelvin Xu, Yunhan Xu, Linting preprint arXiv:1805.10190.
Xue, Pengcheng Yin, Jiahui Yu, Qiao Zhang, Steven Claude Coulombe. 2018. Text data augmentation made
Zheng, Ce Zheng, Weikang Zhou, Denny Zhou, Slav simple by leveraging nlp cloud apis. arXiv preprint
Petrov, and Yonghui Wu. 2023. Palm 2 technical arXiv:1812.04718.
report.
Haixing Dai, Zhengliang Liu, Wenxiong Liao, Xiaoke
Vamsi Aribandi, Yi Tay, Tal Schuster, Jinfeng Rao, Huang, Yihan Cao, Zihao Wu, Lin Zhao, Shaochen
Huaixiu Steven Zheng, Sanket Vaibhav Mehta, Hon- Xu, Wei Liu, Ninghao Liu, Sheng Li, Dajiang Zhu,
glei Zhuang, Vinh Q Tran, Dara Bahri, Jianmo Ni, Hongmin Cai, Lichao Sun, Quanzheng Li, Dinggang
et al. 2021. Ext5: Towards extreme multi-task scal- Shen, Tianming Liu, and Xiang Li. 2023. Auggpt:
ing for transfer learning. In International Conference Leveraging chatgpt for text data augmentation.
on Learning Representations.
Yihe Deng, Weitong Zhang, Zixiang Chen, and Quan-
Yonatan Belinkov and Yonatan Bisk. 2017. Synthetic quan Gu. 2023a. Rephrase and respond: Let large
and natural noise both break neural machine transla- language models ask better questions for themselves.
tion. arXiv preprint arXiv:1711.02173.
Yuntian Deng, Kiran Prasad, Roland Fernandez,
Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Paul Smolensky, Vishrav Chaudhary, and Stuart
Bowen Baker, Leo Gao, Leopold Aschenbrenner, Shieber. 2023b. Gpt-4 turbo v.s. gpt-4 com-
Yining Chen, Adrien Ecoffet, Manas Joglekar, Jan parison. https://github.com/da03/implicit_
Leike, Ilya Sutskever, and Jeff Wu. 2023. Weak-to- chain_of_thought/tree/main/gpt4_baselines.
strong generalization: Eliciting strong capabilities
with weak supervision. Bosheng Ding, Chengwei Qin, Linlin Liu, Yew Ken
Chia, Boyang Li, Shafiq Joty, and Lidong Bing. 2023.
Yihan Cao, Yanbin Kang, Chi Wang, and Lichao Sun. Is gpt-3 a good data annotator? In Proceedings
2023. Instruction mining: When data mining meets of the 61st Annual Meeting of the Association for
large language model finetuning. Computational Linguistics (Volume 1: Long Papers),
pages 11173–11195.
Lichang Chen, Shiyang Li, Jun Yan, Hai Wang, Kalpa
Gunaratna, Vikas Yadav, Zheng Tang, Vijay Srini- Jon Durbin. 2023. Jondurbin/airoboros-l2-70b-3.1.2 ·
vasan, Tianyi Zhou, Heng Huang, and Hongxia Jin. hugging face.
2023. Alpagasus: Training a better alpaca with fewer
data. Luyang Fang, Gyeong-Geon Lee, and Xiaoming Zhai.
2023. Using gpt-4 to augment unbalanced data for
Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, automatic scoring.
and Quanquan Gu. 2024. Self-play fine-tuning con-
verts weak language models to strong language mod- Steven Y. Feng, Varun Gangal, Jason Wei, Sarath Chan-
els. arXiv preprint arXiv:2401.01335. dar, Soroush Vosoughi, Teruko Mitamura, and Ed-
uard Hovy. 2021. A survey of data augmentation
Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, approaches for NLP. In Findings of the Association
Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan for Computational Linguistics: ACL-IJCNLP 2021,
Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion pages 968–988, Online. Association for Computa-
Stoica, and Eric P. Xing. 2023. Vicuna: An open- tional Linguistics.
source chatbot impressing gpt-4 with 90%* chatgpt
quality. Yao Fu, Litu Ou, Mingyu Chen, Yuhao Wan, Hao Peng,
and Tushar Khot. 2023a. Chain-of-thought hub: A
Hyung Won Chung, Le Hou, Shayne Longpre, Barret continuous effort to measure large language models’
Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi reasoning performance.
Alham Fikri Aji, Khalid Almubarak, Samuel Al- Gerald Tesauro et al. 1995. Temporal difference learn-
banie, Zaid Alyafeai, Albert Webson, Edward Raff, ing and td-gammon. Communications of the ACM,
and Colin Raffel. 2023. Crosslingual generalization 38(3):58–68.
through multitask finetuning. In Annual Meeting of
the Association for Computational Linguistics. Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier
Martinet, Marie-Anne Lachaux, Timothée Lacroix,
Subhabrata Mukherjee, Arindam Mitra, Ganesh Jawa- Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal
har, Sahaj Agarwal, Hamid Palangi, and Ahmed Azhar, et al. 2023a. Llama: Open and effi-
Awadallah. 2023. Orca: Progressive learning from cient foundation language models. arXiv preprint
complex explanation traces of gpt-4. arXiv preprint arXiv:2302.13971.
arXiv:2306.02707. Hugo Touvron, Louis Martin, Kevin Stone, Peter Al-
bert, Amjad Almahairi, Yasmine Babaei, Nikolay
Harsha Nori, Yin Tat Lee, Sheng Zhang, Dean Carig- Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti
nan, Richard Edgar, Nicolo Fusi, Nicholas King, Bhosale, et al. 2023b. Llama 2: Open founda-
Jonathan Larson, Yuanzhi Li, Weishung Liu, et al. tion and fine-tuned chat models. arXiv preprint
2023. Can generalist foundation models outcom- arXiv:2307.09288.
pete special-purpose tuning? case study in medicine.
arXiv preprint arXiv:2311.16452. Solomon Ubani, Suleyman Olcay Polat, and Rodney
Nielsen. 2023. Zeroshotdataaug: Generating and aug-
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Car- menting training data with chatgpt. arXiv preprint
roll L. Wainwright, Pamela Mishkin, Chong Zhang, arXiv:2304.14334.
Sandhini Agarwal, Katarina Slama, Alex Ray, John
Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Shuohang Wang, Yang Liu, Yichong Xu, Chenguang
Maddie Simens, Amanda Askell, Peter Welinder, Zhu, and Michael Zeng. 2021. Want to reduce label-
Paul Christiano, Jan Leike, and Ryan Lowe. 2022. ing cost? gpt-3 can help. In Findings of the Associ-
instructgpt. ation for Computational Linguistics: EMNLP 2021,
pages 4195–4205.
Archiki Prasad, Elias Stengel-Eskin, and Mohit Bansal. Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa
2023. Rephrase, augment, reason: Visual grounding Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh
of questions for vision-language models. Hajishirzi. 2023. Self-instruct: Aligning language
models with self-generated instructions. In Proceed-
Arthur L Samuel. 2000. Some studies in machine learn- ings of the 61st Annual Meeting of the Association for
ing using the game of checkers. IBM Journal of Computational Linguistics (Volume 1: Long Papers),
research and development, 44(1.2):206–226. pages 13484–13508, Toronto, Canada. Association
for Computational Linguistics.
Victor Sanh, Albert Webson, Colin Raffel, Stephen
Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Yizhong Wang, Swaroop Mishra, Pegah Alipoormo-
Chaffin, Arnaud Stiegler, Arun Raja, Manan Dey, labashi, Yeganeh Kordi, Amirreza Mirzaei, Atharva
et al. 2021. Multitask prompted training enables Naik, Arjun Ashok, Arut Selvan Dhanasekaran, An-
zero-shot task generalization. In International Con- jana Arunkumar, David Stap, et al. 2022a. Super-
ference on Learning Representations. naturalinstructions: Generalization via declarative
instructions on 1600+ nlp tasks. In Proceedings of
Avi Singh, John D Co-Reyes, Rishabh Agarwal, Ankesh the 2022 Conference on Empirical Methods in Natu-
Anand, Piyush Patil, Peter J Liu, James Harri- ral Language Processing, pages 5085–5109.
son, Jaehoon Lee, Kelvin Xu, Aaron Parisi, et al.
2023. Beyond human data: Scaling self-training Yufei Wang, Jiayi Zheng, Can Xu, Xiubo Geng, Tao
for problem-solving with language models. arXiv Shen, Chongyang Tao, and Daxin Jiang. 2022b.
preprint arXiv:2312.06585. Knowda: All-in-one knowledge mixture model for
data augmentation in few-shot nlp. arXiv preprint
Richard Socher, Alex Perelygin, Jean Wu, Jason arXiv:2206.10265.
Chuang, Christopher D Manning, Andrew Y Ng, and Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu,
Christopher Potts. 2013. Recursive deep models for Adams Wei Yu, Brian Lester, Nan Du, Andrew M
semantic compositionality over a sentiment treebank. Dai, and Quoc V Le. 2021. Finetuned language mod-
In Proceedings of the 2013 conference on empiri- els are zero-shot learners. In International Confer-
cal methods in natural language processing, pages ence on Learning Representations.
1631–1642.
Jason Wei and Kai Zou. 2019. EDA: Easy data augmen-
Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann tation techniques for boosting performance on text
Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, classification tasks. In Proceedings of the 2019 Con-
and Tatsunori B Hashimoto. 2023. Alpaca: A ference on Empirical Methods in Natural Language
strong, replicable instruction-following model. Stan- Processing and the 9th International Joint Confer-
ford Center for Research on Foundation Models. ence on Natural Language Processing (EMNLP-
https://crfm. stanford. edu/2023/03/13/alpaca. html, IJCNLP), pages 6382–6388, Hong Kong, China. As-
3(6):7. sociation for Computational Linguistics.
A Experimental Setup A.2 Models

A.1 Datasets For all of our experiments, we use Llama-2-7B


(Touvron et al., 2023b) as the student model. We
We evaluate LLM2LLM on five different datasets
perform minimal prompt tuning for each task, only
that are either multiple-choice or classification
formatting the data as necessary and not employing
tasks and were widely adopted in prior works in-
many few-shot examples, which can be seen in Fig-
cluding (Ubani et al., 2023). Our datasets are as
ure B.1 and Figure B.2. Using excessive prompting
follows:
would undermine the benefits of fine-tuning and
1. GSM8K: a grade school math work multiple- would muddle the evaluation of the effectiveness of
choice dataset that consists of 7.5K train prob- LLM2LLM. Our fine-tuning settings are described
lems and 1K test problems (Cobbe et al., 2021). in Section A.4.
Figure B.1 shows an example. For our main experiments, we use GPT-3.5
(1106 release) as the teacher model for data gener-
2. CaseHOLD: a multiple-choice law dataset that ation. In Section 4.4 and Table A.1, we show that
requires one to choose the relevant holding (i.e. our framework can be extended to different teacher
the court’s determination) of a cited case which models such as the more powerful GPT-4-Turbo
backs up the proceeding argument (Zheng et al., (1106 release) model as well as open source LLMs
2021). Figure B.2 shows an example. such as Llama-2-70B-chat (Touvron et al., 2023b)
and Airoboros-l2-70b-3.1.2 (Durbin, 2023).
3. SNIPS: a 7-way classification dataset to deter-
mine the correct user intent for a voice assis-
tant. (Coucke et al., 2018). Figure B.3 shows an A.3 Baselines and Evaluation
example. To measure the efficacy of LLM2LLM, we fine-
tune the student model using samples of different
4. TREC: a 6-way classification dataset where one sizes from each dataset and evaluate on the valida-
must classify the type of text into a category e.g., tion sets of each of these datasets. We then run 10
abbreviation, location, or numeric value (Li and steps of LLM2LLM, and use the validation sets to
Roth, 2002). Figure B.4 shows an example. select the best performing model. In Section 4.2,
we compare these results against basic fine-tuning
5. SST-2: a binary classification dataset to decide
on just the seed data.
whether a sentence of positive or negative senti-
ment (Socher et al., 2013). Figure B.5 shows an For GSM8K and TREC, since there is no devel-
example. opment set, we choose the best checkpoint’s test
set results to be representative for the overall im-
For each dataset, we sample between 0.02% to provement. Similarly, for SST-2, since the test set
50% of the total training data and use this as the labels are not public, we use the development set
seed data for each experiment. This allows us to results. For all other datasets, we record the test set
measure how effectively LLM2LLM scales up performance of the checkpoint that performs best
small task-specific datasets. For consistency, we on the development set.
use identical samples of seed data across different For TREC, SST-2, and SNIPS, since these
experiments (e.g. 1% on GSM8K) to avoid intro- are simple classification tasks, we perform string
ducing new randomness with different samples. matching between the generated output and the
In particular, for SNIPS, TREC, and SST-2, ground truth after some cleanup. For CaseHOLD,
we always uniformly sample the same number of which is a multiple choice task, we extract the let-
examples per class, similar to (Dai et al., 2023; ter of the answer that the model generates. For
Ubani et al., 2023). In Table 1, we sample 10, GSM8K, we use a regular expression extraction
15, and 20 examples per class to measure the ef- based on the answer format that GSM8K provides.
ficacy of LLM2LLM in the extreme low-data Specifically, we extract the number after the ####
regime. These three tasks are relatively simpler token.
than GSM8K and CaseHOLD, and therefore us- In Section 4.3, we only sample 100 examples
ing an extremely small amount of training data is from GSM8K and CaseHOLD. For SNIPS, TREC,
sufficient to achieve exemplary performance. and SST-2, we sample 10 examples per class. We
Dataset # Seed Teacher Total # Aug Accuracy (%)
Llama-2-70B 333 11.83
Airoboros 345 15.01
74 (1%)
GPT-3.5 391 19.56
GPT-4-Turbo 388 19.79
Llama-2-70B 661 17.59
GSM8K Airoboros 671 19.33
149 (2%)
GPT-3.5 802 25.70
GPT-4-Turbo 805 25.78
Llama-2-70B 1308 19.33
Airoboros 1286 21.76
343 (5%)
GPT-3.5 1641 27.07
GPT-4-Turbo 1739 28.43

Table A.1: Experiments on how the quality of teacher model affects the performance of LLM2LLM. For each
of these experiments, we only change the teacher model to measure the effect of the teacher model on the final
outcome.

instructions for generating new data points from Dataset % Data Cost ($) Time (Hours)
stronger models such as GPT-3.5 and GPT-4. This 1% 0.35 3.28
was necessary as these approaches targeted improv- GSM8K 5% 1.48 9.07
10% 3.64 14.54
ing the LLM over a wide range of different tasks.
However, for LLM2LLM, we are trying to im- 1% 1.50 6.68
CaseHOLD 5% 0.84 16.87
prove the LLM at domain-specific tasks. Thus, for 10% 2.19 31.95
each task, the system prompt that we give to the
0.5% 0.02 0.85
teacher-model differs on a per-task basis. This al- SNIPS 0.8% 0.05 1.29
lows the user to inject and leverage domain-specific 1% 0.05 1.40
knowledge about the task into the dataset genera- 1.1% 0.05 0.67
tion procedure, creating higher quality fine-tuning TREC 1.6% 0.01 0.44
2.2% 0.02 0.61
data. In practice, we also use in-context learning
0.02% 0.01 0.54
with few-shot prompting to bootstrap the teacher
SST-2 0.04% 0.01 0.80
model’s ability to generate relevant questions. 0.06% 0.00 0.64
The detailed system prompt and in-context ex-
Table B.2: Training and data generation Costs of
amples for each dataset are provided below: LLM2LLM. The first and second columns indicate the
1. GSM8K: System (Figure B.6) and In-Context dataset and percentage of the training data used as initial
seed data for that experiment. The third column indi-
Examples (Figure B.7)
cates the total cost to generate the data from the GPT-3.5
2. CaseHOLD: System (Figure B.8) and In- teacher model. The fourth column shows the total time
Context Examples (Figure B.9) in hours to train and evaluate the student model. As we
can see, data generation costs for LLM2LLM are rela-
3. SNIPS: System (Figure B.10) and In-Context tively little compared to the cost of manually curating
Examples (Figure B.11) new data. Furthermore, fine-tuning and evaluation of
the student model finishes in a reasonable time.
4. TREC: System (Figure B.12) and In-Context
Examples (Figure B.13) these numbers using 4xA100-80GB PCIe based
5. SST-2: System (Figure B.14) and In-Context NVIDIA GPUs. As we can see, generating data
Examples (Figure B.15) for LLM2LLM costs relatively little compared to
the cost of collecting new data points manually.
B.5 Training and Data Generation Costs Furthermore, the process of fine-tuning the student
model also finishes in a reasonable amount of time.
In Table B.2, we report the training and data gener-
ation costs to perform LLM2LLM. This includes
B.6 Decontamination Experiments
the cost of generating new data from OpenAI as
well as the amount of GPU hours required to train When using an LLM to generate data, there are
and evaluate the student models. We measured potential concerns of data contamination i.e., when
GSM8K Example:

Victor, Austin, and Brian made traps to catch shrimp. Victor’s trap caught 26 shrimp
and Austin’s trap caught 8 less than Victor’s. Brian’s trap caught half of Victor and Austin’s total
number of shrimp. If the boys then sold their shrimp for $7 for every 11 tails of shrimp and then
divided their earnings equally amongst themselves, how much money does each boy make?

Austin’s trap caught 26 - 8 = «26-8=18»18 shrimp.


Together, Victor and Austin’s traps caught 18 + 26 = «18+26=44»44 shrimp.
Brian’s trap caught 44/2 = «44/2=22»22 shrimp
In total, they caught 26 + 18 + 22 = «26+18+22=66»66 shrimp.
They were able to sell 66/11 = «66/11=6»6 sets of shrimp.
They made a total of 6 x 7 =«6*7=42»42
Each boy made 42/3 =«42/3=14»14
#### 14

Figure B.1: Formatted example from GSM8K.

CaseHOLD Example:

The following context is from a judicial decision where the holding statement has been
masked out as <HOLDING>.

Context: from behind the bench in the robe is protected by the First Amendment, even
if his use of the trappings of judicial office were notprotected by First Amendment); Halleck v.
Berlinger, 427 F. Supp. 1225, 1241 (D.D.C. 1977) (applying the First Amendment in disciplinary
proceeding to comments made from the bench, but finding the particular comments outside of its
protection); Mississippi Comm’n on Judicial Performance v. Boland, 975 So.2d 882, 891-92 (Miss.
2008) (applying First Amendment to ajudge acting in her “capacity as a justice court judge” at a
conference seeking certification to start a drug court, but held that First Amendment did not apply
because judge’s insulting comments were not matters of “legitimate public concern.”); In re Rome,
218 Kan. 198, 542 P.2d 676, 684 (1975) (<HOLDING>). 11 As indicated in the Gentile syllabus,

Please select the correct holding statement from the options below.

A. holding that free speech protection of new jersey constitution requires subject to rea-
sonable restrictions privatelyowned shopping centers to permit speech on political and societal
issues on premises unlike first amendment of federal constitution
B. recognizing that code is speech
C. holding that first amendment protections apply to compelled speech as well as restrictions on
speech
D. holding that although ajudge has the right of free speech any restrictions placed by the code of
professional responsibility are acceptable limits and prevent the first amendment from exempting a
judge from discipline for proven judicial misconduct
E. holding that the first amendment limits judicial discretion to seal documents in a civil case

Figure B.2: Formatted example from CaseHOLD.


System:

You are a educational A.I. whose purpose is to take math problems that students get
wrong and generate new problems to help them practice their mathematical skills. Your goal is to
generate a set of new math problems that reflect the different skills and techniques found in the
example problem.

Here are the requirements:


1. A GPT language model should be able to complete the problem. For example, do not ask the
assistant to create any visual or audio output. For another example, do not ask the assistant to
wake you up at 5pm or set a reminder because it cannot perform any action.
2. The math problem should be in English.
3. The output should be an appropriate response to the question. Make sure the output is less than
100 words.
4. The answer to the problem should be expressed as a number, not a fraction. For example, if the
answer is one-half, return 0.5, not 1/2 or "one half".
5. The answer to the problem should not have units i.e. if the answer is 6 cups, just write 6 as the
[ANSWER]
6. Always include some calculation to show your work for how you got your ANSWER.
7. Don’t make any mathematical mistakes of your own!
8. Try not to copy too much information from the original problem. If you must, try and replace
names and numbers so that we can test the student’s understanding, rather than their ability to
memorize previous test questions.

Always return your instructions in the form:


1. Question: [QUESTION]
Answer: [CALCULATION]
#### [ANSWER]

Figure B.6: System Prompt for GSM8K Generation


System:

You are LawGPT, an AI agent who knows everything there is to know about U.S. law.
You know the result of every court case and you know every law in the lawbook.
The user is trying to choose the correct holding of the case given the context and argument of the
court.
You are trying to give the user assistance by giving them more practice questions for the questions
that they get wrong.

Here are the requirements:


1. A GPT language model should be able to complete the problem. For example, do not ask the
assistant to create any visual or audio output. For another example, do not ask the assistant to
wake you up at 5pm or set a reminder because it cannot perform any action.
2. The context, holding, and options should be in english.
3. The questions that you generate should test for whether the user understands the case names and
their holdings and whether the user can re-frame relevant holdings to backup the argument in the
context.
4. The context should always end with a citation such as "See United States v. Newman, 125 F.3d
863 (10th Cir.1997) (unpublished) (<HOLDING>); United States v. Dodge, 846 F.Supp. 181,"
5. The citation absolutely needs to have the mask phrase <HOLDING> which is the place where
the legal holding would normally be.
6. The questions should always be multiple choice.
7. There should always be 5 options: 1 options should be a holding that backs up the argument in
the context, the other 4 should be sufficiently different. Each option has to start with the word
"holding"
8. There can only be 1 answer: A, B, C, D, or E.
9. Don’t make any mistakes matching the holdings yourself.
10. Try not to copy too much information from the original problem. You don’t want the user to
just memorize their answer.
11. Make the context similar to the context in question, make sure that the holding that is being
tested is the same.
12. The wrong answer choices can be any other reasonable holding, but it should be sufficiently
different from the correct answer.
13. Do not make your context too short. Remember, these arguments in the context are being
made by judges and should look like they were written by a judge.

Always return your instructions in the form:


1. Context: [CONTEXT]

Please select the correct holding statement from the options below.

A. [OPTION 1]
B. [OPTION 2]
C. [OPTION 3]
D. [OPTION 4]
E. [OPTION 5]
Answer: [ANSWER]

Figure B.8: System Prompt for CaseHOLD Generation


System:
You are TranscriptGPT, an AI agent who knows the intent of the transcript of different questions.
You are training someone how to identify people’s intents from what they have said.
You are trying to give the user assistance by giving them more practice questions for the questions
that they get wrong.

Here are the requirements:


1. A GPT language model should be able to complete the problem. For example, do not ask the
assistant to create any visual or audio output. For another example, do not ask the assistant to
wake you up at 5pm or set a reminder because it cannot perform any action.
2. The question and options should be in english.
3. The questions that you generate should have only 1 of the following intents:
- AddToPlaylist
- BookRestaurant
- GetWeather
- PlayMusic
- RateBook
- SearchCreativeWork
- SearchScreeningEvent
4. The questions should always have 1 specific intent.
5. The intent of the question must come from the list above.
6. Don’t make any mistakes with your answer yourself.
7. Try not to copy too much information from the original problem. You don’t want the user to just
memorize the practice problems.
8. Make the intent the same as the question that the user got wrong.
9. The wrong answer choices can be any other reasonable answer, but it should be sufficiently
different from the correct answer.
10. The transcript should be something that an ASR model could output: it must sound like
something a human could say.

Always return your instructions in the form:


1. Transcript: [CONTEXT]
Intent: [INTENT]

Figure B.10: System Prompt for SNIPS Generation


System:

You are QuestionGPT, an AI agent who knows the class of different question.
You are training someone how to classify different questions based on what the questions are
asking form.
You are trying to give the user assistance by giving them more practice questions for the questions
that they get wrong.

Here are the requirements:


1. A GPT language model should be able to complete the problem. For example, do not ask the
assistant to create any visual or audio output. For another example, do not ask the assistant to
wake you up at 5pm or set a reminder because it cannot perform any action.
2. The question should be in english.
3. The questions that you generate should have only 1 of the following intents:
- ABBR (Abbreviation)
- ENTY (Entity)
- DESC (Description/Concept)
- HUM (Human)
- LOC (Location)
- NUM (Number)
4. The questions should always have 1 specific class.
5. The intent of the question must come from the list above.
6. Don’t make any mistakes with your answer yourself.
7. Try not to copy too much information from the original problem. You don’t want the user to just
memorize the practice problems.
8. Make the class the same as the question that the user got wrong.
9. The question should be something that an ASR model could output: it must sound like
something a human could say.

Always return your instructions in the form:


1. Question: [CONTEXT]
Class: [INTENT]

Figure B.12: System Prompt for TREC Generation


User:

The following is a movie review that the user classified incorrectly including the correct
classification:
Classify the following movie review as positive or negative: as they come , already having been
recycled more times than i ’d care to count
Sentiment: negative

Generate 1 more similar movie review with the same class.

Assistant:

Here’s a similar question with the same class:

1. Review: Feels like a reheated plot, utterly predictable and uninspired.


Sentiment: negative

Figure B.15: In-Context Example for SST-2 Generation

You might also like