0% found this document useful (0 votes)

54 views14 pages

How Should We Increase The User's Training Data To Be Enough For Fine-Tuning?

The document introduces LLM2LLM, a targeted and iterative data augmentation framework designed to enhance training datasets for fine-tuning language models. It highlights the limitations of traditional data augmentation methods and demonstrates LLM2LLM's effectiveness in improving model performance across various NLP tasks, particularly in low-data regimes. Empirical results show significant accuracy improvements when using LLM2LLM compared to standard fine-tuning approaches.

Uploaded by

Giovanni Aparecido da Silva Oliveira

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

54 views14 pages

How Should We Increase The User's Training Data To Be Enough For Fine-Tuning?

Uploaded by

Giovanni Aparecido da Silva Oliveira

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

of task-specific data.

Often, collecting, cleaning, • We benchmark LLM2LLM on randomly sam-

and labeling additional data can be costly and time- pled subsets of GSM8K (Cobbe et al., 2021),
consuming. So the key question is: how should we CaseHOLD (Zheng et al., 2021), SNIPS (Coucke
increase the user’s training data to be enough for et al., 2018), TREC (Li and Roth, 2002) and
fine-tuning? SST-2 (Socher et al., 2013) in order to evaluate
Data augmentation is a known method that could the effectiveness of our approach in the low-data
help effectively expand the training dataset. For regime (Section 4.2). Here, we get up to a 24.2%
natural language processing (NLP) tasks, one can improvement on GSM8K, 32.6% on CaseHOLD,
use approaches such as synonym replacement, char- 32.0% on SNIPS, 52.6% on TREC, and 39.8%
acter replacement (e.g., by intentionally introduc- on SST-2 (Table 1).
ing spelling errors), random swapping, and back • We conduct a series of ablations studies compar-
translation, just to name a few (Wei and Zou, 2019; ing LLM2LLM to several existing baselines as
Belinkov and Bisk, 2017; Coulombe, 2018; Zhang well as to variants of LLM2LLM to evaluate
et al., 2018). However, these approaches fail to the effectiveness of our design decisions (Sec-
effectively expand the training data for fine-tuning tion 4.5). We observe that both the iterative and
LLMs in the case of new and specialized tasks, as targeted nature of LLM2LLM are critical to im-
we will show later in Section 4.3. proving model performance.
To address this, several recent papers have ex-
plored using an LLM to expand the fine-tuning 2 Background and Related Work
dataset (Dai et al., 2023; Kumar et al., 2020; Zhou
2.1 Instruction Following LLMs
et al., 2023; Chen et al., 2023; Cao et al., 2023; Wei
et al., 2023; Zhu et al., 2023). This approach has The earliest works (Wei et al., 2021; Longpre et al.,
proven to be more effective than traditional data 2023; Chung et al., 2022; Aribandi et al., 2021;
augmentation methods. However, these approaches Sanh et al., 2021; Muennighoff et al., 2023; Wang
often apply LLM-based data augmentation on all et al., 2022b; Mishra et al., 2022; Wang et al.,
of the available training dataset, without consider- 2022a; Xu et al., 2022) in instruction fine-tuning
ing the LLM’s prediction accuracy on individual involved gathering and processing different exist-
training data points. We have observed that for var- ing NLP datasets in order to improve the perfor-
ious reasoning tasks such as arithmetic and reading mance of LLMs on a wide range of tasks. Self-
comprehension, the LLM correctly solves simpler Instruct (Wang et al., 2023) removed the reliance
examples in the fine-tuning dataset, but may strug- on existing datasets by introducing a framework for
gle with harder examples. It will be sub-optimal to bootstrapping instruction datasets with the outputs
keep augmenting data points for which the LLM is of the model itself. Follow-up work (Ouyang et al.,
already achieving high accuracy on. 2022; Taori et al., 2023; Geng et al., 2023; Chiang
To address these challenges, we introduce et al., 2023; Xu et al., 2023; Mukherjee et al., 2023;
LLM2LLM, a new targeted and iterative data aug- Mitra et al., 2023; Kang et al., 2023; Nori et al.,
mentation framework that uses a teacher LLM to 2023) took advantage of stronger models (Achiam
expand the training dataset, with a targeted and et al., 2023; Touvron et al., 2023a,b) in order
iterative approach. In more detail, we make the to fine-tune stronger general-purpose instruction-
following contributions: following models.

• We propose LLM2LLM, a targeted and itera- 2.2 Self-Improving LLMs

tive LLM-based data augmentation technique Various early works (Zelikman et al., 2023; Halupt-
that efficiently and effectively augments small zok et al., 2023; Zelikman et al., 2022; Madaan
task-specific datasets. LLM2LLM achieves this et al., 2023; Gulcehre et al., 2023; Singh et al.,
by (1) fine-tuning a student LLM on the initial 2023) explore using self-improvement for fine-
dataset, (2) evaluating on the training data and tuning LLMs. These works generally filtered the
extracting data points which the model got incor- outputs of the model before fine-tuning it on its
rect after training, and (3) using a Self-Instruct own outputs. LLM2LLM differs from these meth-
(Wang et al., 2023) style data augmentation to ods, as we do not directly fine-tune on the outputs
augment these data points, which are then added of our own model, and we employ a teacher model
back into the training data (Section 3.1). to provide feedback in the form of synthetic data.
Algorithm 1 LLM2LLM: Boosting LLMs with we consider the following iterative process:
Novel Iterative Data Enhancement. Given a seed
dataset D0 , we finetune the model Mistudent , eval- Dn+1 = f (Mteacher , Mstudent , Dn , · · · , D0 ). (1)
uate, and extract training set data points that the
model gets wrong. These are used to generate In Equation (1), Mteacher is the teacher model,
new training data points using the teacher model Mstudent is the student model (potentially being
Mteacher for the next step. fine-tuned in many iterations), n refers to the nth
step of data augmentation, Dn+1 is the new train-
1: procedure LLM2LLM(M0student , Mteacher , D0 )
2: i ← 0 ing dataset at the next iteration, and f is the data-
3: while i < n do generation algorithm. At each step, the teacher
4: Mistudent ← Finetune(M0student , Di ) model has access to how the student model per-
5: E i ← Evaluate(Mistudent , D0 ) ▷ Evaluate on seed data
6: W i ← Filter(E i ,D0 ) ▷ Keep wrong answers forms at the nth step (e.g., correct/incorrect labels,
7: Ai ← Generate(Mteacher , W i ) ▷ Augment using teacher or possibly prediction distributions for white-box
8: Di+1 ← Di + Ai ▷ Append to data
9: i←i+1
models), and based on that it can edit training data
10: end while points for the next iteration.
∗
11: Evaluate Mstudent Note that LLM2LLM is different from knowl-
12: end procedure
edge distillation (Hinton et al., 2015). Knowledge
distillation is generally applicable to cases where
the teacher model has high accuracy on the target
seed dataset D, where D potentially has unseen data. In contrast, in this case, it is possible that
characteristics, compared to the pre-trained dataset the teacher model also performs sub-optimally on
(e.g., a medical dataset with specific terminology, the target data (e.g., in the private database case,
or a private database with specific characteristics). where the teacher lacks domain-specific knowl-
In this case, the model’s zero-shot or fine-tuned edge). However, if the teacher model has enough
performance is likely to be unsatisfactory. While reasoning capability to produce conceptually simi-
strategies to address this challenge have been ex- lar but semantically different examples when it is
plored, e.g., through enhanced few-shot learning given both the prompt and answer, then our frame-
methods as discussed in Section 2, here we strictly work can improve performance.
focus on enriching the provided target dataset D In LLM2LLM, we consider a specific instantia-
with an LLM. This method is orthogonal to the tion of Equation (1), as discussed next.
aforementioned techniques, offering a complemen-
tary solution that can be applied alongside them. 3.1 LLM2LLM
To enrich D, AugGPT (Dai et al., 2023) has The end-to-end algorithm of LLM2LLM is pre-
introduced a promising approach that generates ad- sented in Algorithm 1. Inspired by Self-Instruct
ditional augmented data by applying a prompted (Wang et al., 2023), we use the teacher model
LLM to all available data points in the target train- Mteacher to generate synthetic data from the data
ing dataset. However, this method falls short by points that the model got incorrect during training
indiscriminately augmenting data without consider- in order to target these deficiencies in the student
ing the student model’s varying performance across model. In more detail, we first train the baseline
different data points. For instance, the model may student model Mstudent on the provided target data
easily solve the majority of the dataset, but it may D0 , and we evaluate its performance (lines 4-5 of
struggle with a small subset of more challenging Algorithm 1). We then filter the results and keep the
examples. In this case, rather than indiscriminately incorrect training examples that the student model
expanding the dataset by replicating simpler cases, struggled to answer correctly (E i in line 6). Then
a better augmentation strategy would be to generate the teacher model is prompted to create additional
more data points that align conceptually with these training data points that are conceptually aligned
challenging examples. This is because the former but semantically different (line 7, see Section B.4
approach could lead to longer training time without for specifics on the prompt). The teacher model
noticeable performance improvement. does not necessarily need to be bigger, although
Here, we propose a more general formulation that could potentially improve performance. The
of an LLM-based data augmentation pipeline that primary requirement for the teacher model is to
addresses the aforementioned limitation. To do so, have reasoning capability to be able to follow the
LLM2LLM on GSM8K test set with varying seed data LLM2LLM on CaseHOLD test set with varying seed data
30
25 70
Accuracy (%)

Accuracy (%)
20 60
15
74 (1%) seed data 50 225 (0.5%) seed data
10
149 (2%) seed data 450 (1%) seed data
5 373 (5%) seed data 40 900 (2%) seed data
747 (10%) seed data 2250 (5%) seed data
0
0 500 1000 1500 2000 2500 3000 500 1000 1500 2000 2500
Total Data Size (Seed + LLM2LLM Data) Total Data Size (Seed + LLM2LLM Data)

Figure 2: LLM2LLM on GSM8K (left) and CaseHOLD (right) with various seed data sizes. Each line shows
the test accuracy of the finetuned Llama-2-7B model with each step of LLM2LLM with varying seed dataset
size. The first (left-most) data point on each line represents finetuning only on the seed data. Each point afterward
corresponds to the performance after one more iteration of LLM2LLM. The total data size (x-axis) represents the
total amount of seed plus LLM2LLM data that was used to train the model at that step. By applying LLM2LLM
with low amounts of seed data and iteratively improving the training dataset, we can attain significant performance
improvements. In particular, we can see that running LLM2LLM can match or even exceed the performance of
simply annotating more real data in some cases (detailed breakdown provided in Table 1).

W i during dataset generation to only data from used to evaluate final model accuracy. This ad-
the original seed data allows us to bound the total dresses the issue of potential test data leakage that
number of training data points to could have happened if the teacher model had been
trained on similar data.
j
X
j
|D | = n + npi ≤ n(1 + jpmax ), 4.2 Main Results
i=0
Here, we discuss LLM2LLM’s performance with
which has an upper-bound that grows linearly with varying amount of training data by presenting re-
the number of steps. The empirical evaluations sults for fine-tuning Llama-2-7B on GSM8K using
shown in Section 4.5.2 (Table 4) corroborates this. GPT-3.5 as the teacher model. We then discuss how
these trends extend to different datasets (Table 1).
4 Results The final model accuracy after applying 10 iter-
ations of LLM2LLM is given in Table 1. For a
4.1 Experimental Setup low-data regime with 74 available examples (i.e.,
To evaluate the performance of LLM2LLM, we 1% of the GSM8K training dataset), vanilla fine-
applied our framework to fine-tune Llama-2-7B on tuning achieves only 0.99% test accuracy. However,
various target datasets, including GSM8K (Cobbe LLM2LLM boosts the accuracy to 19.56% by iter-
et al., 2021), CaseHOLD (Nori et al., 2023), SNIPS atively generating 391 additional examples based
(Coucke et al., 2018), TREC (Li and Roth, 2002) on data points where the model makes mistakes.
and SST-2 (Socher et al., 2013). We subsampled With slightly more available data of 149 seed ex-
these datasets with different sampling rates from amples (i.e., 2% of the training dataset) we can
0.02% to 50% to evaluate performance across dif- achieve 25.70% accuracy. As shown in the base-
ferent low-data regimes. Our teacher model for line accuracy with 20% data in Table 2, we would
these results is GPT-3.5 (1106 release) unless other- need over 10× more training data points to match
wise specified. We considered several other teacher this accuracy if we only rely on vanilla fine-tuning.
models, including GPT-3.5, GPT-4-Turbo, Llama- We also highlight that LLM2LLM can lead to no-
2-70B (Touvron et al., 2023b), and Airoboros-L2- ticeable gains with data-sufficient regimes (e.g.,
70B (Durbin, 2023) in Section 4.4. We include 100% data), albeit at a smaller improvement over
a more detailed experimental setup in Section A. the baseline compared to lower-data regimes.
Additionally, we conducted additional experiments We observe a similar trend for CaseHOLD,
(Section B.6) to ensure that the augmented data SNIPS, TREC, and SST-2, where LLM2LLM
from the teacher models differs from the test dataset helps improve performance in the low-data regime.
Dataset Steps Total Aug. Acc. (%) From-scratch Total
Dataset Acc. (%)
Fine-tuning Aug.
1 (one-shot) 490 16.30
GSM8K
10 (iterative) 471 23.73 ✗ 230 14.71
GSM8K
✓ 471 23.75
1 (one-shot) 276 59.94
CaseHOLD
10 (iterative) 198 64.50 ✗ 154 60.50
CaseHOLD
✓ 198 64.50
Table 3: Ablation on the iterative nature of LLM2LLM
with 100 seed data points. Steps refers to the total num- Table 5: Ablation study on whether to fine-tune from
ber of augmentation steps in LLM2LLM. For the case scratch or to do continuous fine-tuning. From-scratch
of 1 iteration, we prompt the teacher model to generate Fine-tuning refers to whether we fine-tune the base
more samples all at once, whereas in the 10 steps case model from scratch vs. fine-tune the previous step’s
the teacher model only generates 1 new data point per model. Total Aug. refers to the total number of augmen-
wrong example. The results clearly show that the latter tated examples generated over 10 steps of LLM2LLM.
iterative approach results in better performance.
well as excessive amounts of total augmented data
Only Aug. Total points, as we demonstrate in Table 4. On GSM8K,
Dataset Acc. (%)
Seed Data Aug.
generating augmented data from the previous it-
✗ 4302 18.32
GSM8K
✓ 471 23.75 eration’s augmented data yields 18.3% accuracy,
while using the seed data for further augmentation
✗ 351 63.75
CaseHOLD improves the accuracy to 23.75%. We observe a
✓ 198 64.50
similar trend for CaseHOLD. As discussed in Sec-
Table 4: Ablation study on whether to augment pre-
tion 3.1, a potential reason for the performance
viously generated LLM2LLM data. Only Aug. Seed
Data refers to augmenting only the seed data vs. also drop, when using augmented data for further aug-
re-augmenting the augmented data. Total Aug. refers mentation, has to do with a deviation from the
to the total number of augmentations generated over 10 original data distribution.
steps of LLM2LLM.
4.5.3 From-scratch Fine-tuning vs
a single iteration, for both the GSM8K and Case- Continuous Fine-tuning
HOLD datasets. As shown in Table 3, using a Another key decision for LLM2LLM is whether to
single augmentation step with a larger amount continue fine-tuning from the last iteration’s check-
of augmented data significantly underperforms point (i.e. continuous fine-tuning) or to restart fine-
the alternative of executing 10 iterative steps of tuning from the pre-trained model at each itera-
LLM2LLM with a smaller number of augmentation (i.e. from-scratch fine-tuning). Considering
tions per iteration. In particular, on GSM8K, aug- the non-convex nature of the optimization target
menting one data point per example over 10 steps and complex loss landscapes, this decision is not
yields a 7.4% higher accuracy than augmenting necessarily obvious. Nevertheless, as shown in Ta-
five data points per example in a single step. Simi- ble 5, we observe that from-scratch fine-tuning con-
larly, on CaseHOLD, iterative augmentation of one sistently and significantly outperforms continuous
data points per example over 10 steps results in a fine-tuning, with up to 9% accuracy improvement.
4.6% improvement over a one-shot augmentation The inferior performance of continuous fine-tuning
with four data points per example. This justifies can be attributed to a potential overfitting to small
the LLM2LLM’s iterative augmentation approach seed data over multiple iterations of fine-tuning,
that generates one data point per each incorrectly especially in lower-data regimes where the seed
answered example. data is small. This can be alleviated by restarting
fine-tuning from scratch in each iteration with suf-
4.5.2 Data Augmentation with Seed Data vs ficient augmented data appended to the seed data
Augmented Data to form the training dataset.
In each iteration, LLM2LLM evaluates the stu-
5 Conclusion
dent model’s performance only on the original seed
dataset and generates augmented data from incor- We have introduced LLM2LLM, an adaptive and
rect seed examples. However, a possible alternative iterative LLM-based data augmentation framework
is performing evaluation and data augmentation us- that uses LLMs to scale up smaller fine-tuning
ing both seed and previously augmented data. The datasets in lieu of manually generating more data.
latter often leads to sub-optimal performance as This framework substantially reduces the amount
Howland, Andrea Hu, Jeffrey Hui, Jeremy Hur- Wang, Mostafa Dehghani, Siddhartha Brahma, et al.
witz, Michael Isard, Abe Ittycheriah, Matthew Jagiel- 2022. Scaling instruction-finetuned language models.
ski, Wenhao Jia, Kathleen Kenealy, Maxim Krikun, arXiv preprint arXiv:2210.11416.
Sneha Kudugunta, Chang Lan, Katherine Lee, Ben-
jamin Lee, Eric Li, Music Li, Wei Li, YaGuang Li, Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian,
Jian Li, Hyeontaek Lim, Hanzhao Lin, Zhongtao Liu, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias
Frederick Liu, Marcello Maggioni, Aroma Mahendru, Plappert, Jerry Tworek, Jacob Hilton, Reiichiro
Joshua Maynez, Vedant Misra, Maysam Moussalem, Nakano, Christopher Hesse, and John Schulman.
Zachary Nado, John Nham, Eric Ni, Andrew Nys- 2021. Training verifiers to solve math word prob-
trom, Alicia Parrish, Marie Pellat, Martin Polacek, lems. arXiv preprint arXiv:2110.14168.
Alex Polozov, Reiner Pope, Siyuan Qiao, Emily Reif,
Bryan Richter, Parker Riley, Alex Castro Ros, Au- Alice Coucke, Alaa Saade, Adrien Ball, Théodore
rko Roy, Brennan Saeta, Rajkumar Samuel, Renee Bluche, Alexandre Caulier, David Leroy, Clément
Shelby, Ambrose Slone, Daniel Smilkov, David R. Doumouro, Thibault Gisselbrecht, Francesco Calta-
So, Daniel Sohn, Simon Tokumine, Dasha Valter, girone, Thibaut Lavril, et al. 2018. Snips voice plat-
Vijay Vasudevan, Kiran Vodrahalli, Xuezhi Wang, form: an embedded spoken language understanding
Pidong Wang, Zirui Wang, Tao Wang, John Wiet- system for private-by-design voice interfaces. arXiv
ing, Yuhuai Wu, Kelvin Xu, Yunhan Xu, Linting preprint arXiv:1805.10190.
Xue, Pengcheng Yin, Jiahui Yu, Qiao Zhang, Steven Claude Coulombe. 2018. Text data augmentation made
Zheng, Ce Zheng, Weikang Zhou, Denny Zhou, Slav simple by leveraging nlp cloud apis. arXiv preprint
Petrov, and Yonghui Wu. 2023. Palm 2 technical arXiv:1812.04718.
report.
Haixing Dai, Zhengliang Liu, Wenxiong Liao, Xiaoke
Vamsi Aribandi, Yi Tay, Tal Schuster, Jinfeng Rao, Huang, Yihan Cao, Zihao Wu, Lin Zhao, Shaochen
Huaixiu Steven Zheng, Sanket Vaibhav Mehta, Hon- Xu, Wei Liu, Ninghao Liu, Sheng Li, Dajiang Zhu,
glei Zhuang, Vinh Q Tran, Dara Bahri, Jianmo Ni, Hongmin Cai, Lichao Sun, Quanzheng Li, Dinggang
et al. 2021. Ext5: Towards extreme multi-task scal- Shen, Tianming Liu, and Xiang Li. 2023. Auggpt:
ing for transfer learning. In International Conference Leveraging chatgpt for text data augmentation.
on Learning Representations.
Yihe Deng, Weitong Zhang, Zixiang Chen, and Quan-
Yonatan Belinkov and Yonatan Bisk. 2017. Synthetic quan Gu. 2023a. Rephrase and respond: Let large
and natural noise both break neural machine transla- language models ask better questions for themselves.
tion. arXiv preprint arXiv:1711.02173.
Yuntian Deng, Kiran Prasad, Roland Fernandez,
Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Paul Smolensky, Vishrav Chaudhary, and Stuart
Bowen Baker, Leo Gao, Leopold Aschenbrenner, Shieber. 2023b. Gpt-4 turbo v.s. gpt-4 com-
Yining Chen, Adrien Ecoffet, Manas Joglekar, Jan parison. https://github.com/da03/implicit_
Leike, Ilya Sutskever, and Jeff Wu. 2023. Weak-to- chain_of_thought/tree/main/gpt4_baselines.
strong generalization: Eliciting strong capabilities
with weak supervision. Bosheng Ding, Chengwei Qin, Linlin Liu, Yew Ken
Chia, Boyang Li, Shafiq Joty, and Lidong Bing. 2023.
Yihan Cao, Yanbin Kang, Chi Wang, and Lichao Sun. Is gpt-3 a good data annotator? In Proceedings
2023. Instruction mining: When data mining meets of the 61st Annual Meeting of the Association for
large language model finetuning. Computational Linguistics (Volume 1: Long Papers),
pages 11173–11195.
Lichang Chen, Shiyang Li, Jun Yan, Hai Wang, Kalpa
Gunaratna, Vikas Yadav, Zheng Tang, Vijay Srini- Jon Durbin. 2023. Jondurbin/airoboros-l2-70b-3.1.2 ·
vasan, Tianyi Zhou, Heng Huang, and Hongxia Jin. hugging face.
2023. Alpagasus: Training a better alpaca with fewer
data. Luyang Fang, Gyeong-Geon Lee, and Xiaoming Zhai.
2023. Using gpt-4 to augment unbalanced data for
Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, automatic scoring.
and Quanquan Gu. 2024. Self-play fine-tuning con-
verts weak language models to strong language mod- Steven Y. Feng, Varun Gangal, Jason Wei, Sarath Chan-
els. arXiv preprint arXiv:2401.01335. dar, Soroush Vosoughi, Teruko Mitamura, and Ed-
uard Hovy. 2021. A survey of data augmentation
Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, approaches for NLP. In Findings of the Association
Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan for Computational Linguistics: ACL-IJCNLP 2021,
Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion pages 968–988, Online. Association for Computa-
Stoica, and Eric P. Xing. 2023. Vicuna: An open- tional Linguistics.
source chatbot impressing gpt-4 with 90%* chatgpt
quality. Yao Fu, Litu Ou, Mingyu Chen, Yuhao Wan, Hao Peng,
and Tushar Khot. 2023a. Chain-of-thought hub: A
Hyung Won Chung, Le Hou, Shayne Longpre, Barret continuous effort to measure large language models’
Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi reasoning performance.
Alham Fikri Aji, Khalid Almubarak, Samuel Al- Gerald Tesauro et al. 1995. Temporal difference learn-
banie, Zaid Alyafeai, Albert Webson, Edward Raff, ing and td-gammon. Communications of the ACM,
and Colin Raffel. 2023. Crosslingual generalization 38(3):58–68.
through multitask finetuning. In Annual Meeting of
the Association for Computational Linguistics. Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier
Martinet, Marie-Anne Lachaux, Timothée Lacroix,
Subhabrata Mukherjee, Arindam Mitra, Ganesh Jawa- Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal
har, Sahaj Agarwal, Hamid Palangi, and Ahmed Azhar, et al. 2023a. Llama: Open and effi-
Awadallah. 2023. Orca: Progressive learning from cient foundation language models. arXiv preprint
complex explanation traces of gpt-4. arXiv preprint arXiv:2302.13971.
arXiv:2306.02707. Hugo Touvron, Louis Martin, Kevin Stone, Peter Al-
bert, Amjad Almahairi, Yasmine Babaei, Nikolay
Harsha Nori, Yin Tat Lee, Sheng Zhang, Dean Carig- Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti
nan, Richard Edgar, Nicolo Fusi, Nicholas King, Bhosale, et al. 2023b. Llama 2: Open founda-
Jonathan Larson, Yuanzhi Li, Weishung Liu, et al. tion and fine-tuned chat models. arXiv preprint
2023. Can generalist foundation models outcom- arXiv:2307.09288.
pete special-purpose tuning? case study in medicine.
arXiv preprint arXiv:2311.16452. Solomon Ubani, Suleyman Olcay Polat, and Rodney
Nielsen. 2023. Zeroshotdataaug: Generating and aug-
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Car- menting training data with chatgpt. arXiv preprint
roll L. Wainwright, Pamela Mishkin, Chong Zhang, arXiv:2304.14334.
Sandhini Agarwal, Katarina Slama, Alex Ray, John
Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Shuohang Wang, Yang Liu, Yichong Xu, Chenguang
Maddie Simens, Amanda Askell, Peter Welinder, Zhu, and Michael Zeng. 2021. Want to reduce label-
Paul Christiano, Jan Leike, and Ryan Lowe. 2022. ing cost? gpt-3 can help. In Findings of the Associ-
instructgpt. ation for Computational Linguistics: EMNLP 2021,
pages 4195–4205.
Archiki Prasad, Elias Stengel-Eskin, and Mohit Bansal. Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa
2023. Rephrase, augment, reason: Visual grounding Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh
of questions for vision-language models. Hajishirzi. 2023. Self-instruct: Aligning language
models with self-generated instructions. In Proceed-
Arthur L Samuel. 2000. Some studies in machine learn- ings of the 61st Annual Meeting of the Association for
ing using the game of checkers. IBM Journal of Computational Linguistics (Volume 1: Long Papers),
research and development, 44(1.2):206–226. pages 13484–13508, Toronto, Canada. Association
for Computational Linguistics.
Victor Sanh, Albert Webson, Colin Raffel, Stephen
Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Yizhong Wang, Swaroop Mishra, Pegah Alipoormo-
Chaffin, Arnaud Stiegler, Arun Raja, Manan Dey, labashi, Yeganeh Kordi, Amirreza Mirzaei, Atharva
et al. 2021. Multitask prompted training enables Naik, Arjun Ashok, Arut Selvan Dhanasekaran, An-
zero-shot task generalization. In International Con- jana Arunkumar, David Stap, et al. 2022a. Super-
ference on Learning Representations. naturalinstructions: Generalization via declarative
instructions on 1600+ nlp tasks. In Proceedings of
Avi Singh, John D Co-Reyes, Rishabh Agarwal, Ankesh the 2022 Conference on Empirical Methods in Natu-
Anand, Piyush Patil, Peter J Liu, James Harri- ral Language Processing, pages 5085–5109.
son, Jaehoon Lee, Kelvin Xu, Aaron Parisi, et al.
2023. Beyond human data: Scaling self-training Yufei Wang, Jiayi Zheng, Can Xu, Xiubo Geng, Tao
for problem-solving with language models. arXiv Shen, Chongyang Tao, and Daxin Jiang. 2022b.
preprint arXiv:2312.06585. Knowda: All-in-one knowledge mixture model for
data augmentation in few-shot nlp. arXiv preprint
Richard Socher, Alex Perelygin, Jean Wu, Jason arXiv:2206.10265.
Chuang, Christopher D Manning, Andrew Y Ng, and Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu,
Christopher Potts. 2013. Recursive deep models for Adams Wei Yu, Brian Lester, Nan Du, Andrew M
semantic compositionality over a sentiment treebank. Dai, and Quoc V Le. 2021. Finetuned language mod-
In Proceedings of the 2013 conference on empiri- els are zero-shot learners. In International Confer-
cal methods in natural language processing, pages ence on Learning Representations.
1631–1642.
Jason Wei and Kai Zou. 2019. EDA: Easy data augmen-
Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann tation techniques for boosting performance on text
Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, classification tasks. In Proceedings of the 2019 Con-
and Tatsunori B Hashimoto. 2023. Alpaca: A ference on Empirical Methods in Natural Language
strong, replicable instruction-following model. Stan- Processing and the 9th International Joint Confer-
ford Center for Research on Foundation Models. ence on Natural Language Processing (EMNLP-
https://crfm. stanford. edu/2023/03/13/alpaca. html, IJCNLP), pages 6382–6388, Hong Kong, China. As-
3(6):7. sociation for Computational Linguistics.
A Experimental Setup A.2 Models

A.1 Datasets For all of our experiments, we use Llama-2-7B

(Touvron et al., 2023b) as the student model. We
We evaluate LLM2LLM on five different datasets
perform minimal prompt tuning for each task, only
that are either multiple-choice or classification
formatting the data as necessary and not employing
tasks and were widely adopted in prior works in-
many few-shot examples, which can be seen in Fig-
cluding (Ubani et al., 2023). Our datasets are as
ure B.1 and Figure B.2. Using excessive prompting
follows:
would undermine the benefits of fine-tuning and
1. GSM8K: a grade school math work multiple- would muddle the evaluation of the effectiveness of
choice dataset that consists of 7.5K train prob- LLM2LLM. Our fine-tuning settings are described
lems and 1K test problems (Cobbe et al., 2021). in Section A.4.
Figure B.1 shows an example. For our main experiments, we use GPT-3.5
(1106 release) as the teacher model for data gener-
2. CaseHOLD: a multiple-choice law dataset that ation. In Section 4.4 and Table A.1, we show that
requires one to choose the relevant holding (i.e. our framework can be extended to different teacher
the court’s determination) of a cited case which models such as the more powerful GPT-4-Turbo
backs up the proceeding argument (Zheng et al., (1106 release) model as well as open source LLMs
2021). Figure B.2 shows an example. such as Llama-2-70B-chat (Touvron et al., 2023b)
and Airoboros-l2-70b-3.1.2 (Durbin, 2023).
3. SNIPS: a 7-way classification dataset to deter-
mine the correct user intent for a voice assis-
tant. (Coucke et al., 2018). Figure B.3 shows an A.3 Baselines and Evaluation
example. To measure the efficacy of LLM2LLM, we fine-
tune the student model using samples of different
4. TREC: a 6-way classification dataset where one sizes from each dataset and evaluate on the valida-
must classify the type of text into a category e.g., tion sets of each of these datasets. We then run 10
abbreviation, location, or numeric value (Li and steps of LLM2LLM, and use the validation sets to
Roth, 2002). Figure B.4 shows an example. select the best performing model. In Section 4.2,
we compare these results against basic fine-tuning
5. SST-2: a binary classification dataset to decide
on just the seed data.
whether a sentence of positive or negative senti-
ment (Socher et al., 2013). Figure B.5 shows an For GSM8K and TREC, since there is no devel-
example. opment set, we choose the best checkpoint’s test
set results to be representative for the overall im-
For each dataset, we sample between 0.02% to provement. Similarly, for SST-2, since the test set
50% of the total training data and use this as the labels are not public, we use the development set
seed data for each experiment. This allows us to results. For all other datasets, we record the test set
measure how effectively LLM2LLM scales up performance of the checkpoint that performs best
small task-specific datasets. For consistency, we on the development set.
use identical samples of seed data across different For TREC, SST-2, and SNIPS, since these
experiments (e.g. 1% on GSM8K) to avoid intro- are simple classification tasks, we perform string
ducing new randomness with different samples. matching between the generated output and the
In particular, for SNIPS, TREC, and SST-2, ground truth after some cleanup. For CaseHOLD,
we always uniformly sample the same number of which is a multiple choice task, we extract the let-
examples per class, similar to (Dai et al., 2023; ter of the answer that the model generates. For
Ubani et al., 2023). In Table 1, we sample 10, GSM8K, we use a regular expression extraction
15, and 20 examples per class to measure the ef- based on the answer format that GSM8K provides.
ficacy of LLM2LLM in the extreme low-data Specifically, we extract the number after the ####
regime. These three tasks are relatively simpler token.
than GSM8K and CaseHOLD, and therefore us- In Section 4.3, we only sample 100 examples
ing an extremely small amount of training data is from GSM8K and CaseHOLD. For SNIPS, TREC,
sufficient to achieve exemplary performance. and SST-2, we sample 10 examples per class. We
Dataset # Seed Teacher Total # Aug Accuracy (%)
Llama-2-70B 333 11.83
Airoboros 345 15.01
74 (1%)
GPT-3.5 391 19.56
GPT-4-Turbo 388 19.79
Llama-2-70B 661 17.59
GSM8K Airoboros 671 19.33
149 (2%)
GPT-3.5 802 25.70
GPT-4-Turbo 805 25.78
Llama-2-70B 1308 19.33
Airoboros 1286 21.76
343 (5%)
GPT-3.5 1641 27.07
GPT-4-Turbo 1739 28.43

Table A.1: Experiments on how the quality of teacher model affects the performance of LLM2LLM. For each
of these experiments, we only change the teacher model to measure the effect of the teacher model on the final
outcome.

instructions for generating new data points from Dataset % Data Cost ($) Time (Hours)
stronger models such as GPT-3.5 and GPT-4. This 1% 0.35 3.28
was necessary as these approaches targeted improv- GSM8K 5% 1.48 9.07
10% 3.64 14.54
ing the LLM over a wide range of different tasks.
However, for LLM2LLM, we are trying to im- 1% 1.50 6.68
CaseHOLD 5% 0.84 16.87
prove the LLM at domain-specific tasks. Thus, for 10% 2.19 31.95
each task, the system prompt that we give to the
0.5% 0.02 0.85
teacher-model differs on a per-task basis. This al- SNIPS 0.8% 0.05 1.29
lows the user to inject and leverage domain-specific 1% 0.05 1.40
knowledge about the task into the dataset genera- 1.1% 0.05 0.67
tion procedure, creating higher quality fine-tuning TREC 1.6% 0.01 0.44
2.2% 0.02 0.61
data. In practice, we also use in-context learning
0.02% 0.01 0.54
with few-shot prompting to bootstrap the teacher
SST-2 0.04% 0.01 0.80
model’s ability to generate relevant questions. 0.06% 0.00 0.64
The detailed system prompt and in-context ex-
Table B.2: Training and data generation Costs of
amples for each dataset are provided below: LLM2LLM. The first and second columns indicate the
1. GSM8K: System (Figure B.6) and In-Context dataset and percentage of the training data used as initial
seed data for that experiment. The third column indi-
Examples (Figure B.7)
cates the total cost to generate the data from the GPT-3.5
2. CaseHOLD: System (Figure B.8) and In- teacher model. The fourth column shows the total time
Context Examples (Figure B.9) in hours to train and evaluate the student model. As we
can see, data generation costs for LLM2LLM are rela-
3. SNIPS: System (Figure B.10) and In-Context tively little compared to the cost of manually curating
Examples (Figure B.11) new data. Furthermore, fine-tuning and evaluation of
the student model finishes in a reasonable time.
4. TREC: System (Figure B.12) and In-Context
Examples (Figure B.13) these numbers using 4xA100-80GB PCIe based
5. SST-2: System (Figure B.14) and In-Context NVIDIA GPUs. As we can see, generating data
Examples (Figure B.15) for LLM2LLM costs relatively little compared to
the cost of collecting new data points manually.
B.5 Training and Data Generation Costs Furthermore, the process of fine-tuning the student
model also finishes in a reasonable amount of time.
In Table B.2, we report the training and data gener-
ation costs to perform LLM2LLM. This includes
B.6 Decontamination Experiments
the cost of generating new data from OpenAI as
well as the amount of GPU hours required to train When using an LLM to generate data, there are
and evaluate the student models. We measured potential concerns of data contamination i.e., when
GSM8K Example:

Victor, Austin, and Brian made traps to catch shrimp. Victor’s trap caught 26 shrimp
and Austin’s trap caught 8 less than Victor’s. Brian’s trap caught half of Victor and Austin’s total
number of shrimp. If the boys then sold their shrimp for $7 for every 11 tails of shrimp and then
divided their earnings equally amongst themselves, how much money does each boy make?

Austin’s trap caught 26 - 8 = «26-8=18»18 shrimp.

Together, Victor and Austin’s traps caught 18 + 26 = «18+26=44»44 shrimp.
Brian’s trap caught 44/2 = «44/2=22»22 shrimp
In total, they caught 26 + 18 + 22 = «26+18+22=66»66 shrimp.
They were able to sell 66/11 = «66/11=6»6 sets of shrimp.
They made a total of 6 x 7 =«6*7=42»42
Each boy made 42/3 =«42/3=14»14
#### 14

Figure B.1: Formatted example from GSM8K.

CaseHOLD Example:

The following context is from a judicial decision where the holding statement has been
masked out as <HOLDING>.

Context: from behind the bench in the robe is protected by the First Amendment, even
if his use of the trappings of judicial office were notprotected by First Amendment); Halleck v.
Berlinger, 427 F. Supp. 1225, 1241 (D.D.C. 1977) (applying the First Amendment in disciplinary
proceeding to comments made from the bench, but finding the particular comments outside of its
protection); Mississippi Comm’n on Judicial Performance v. Boland, 975 So.2d 882, 891-92 (Miss.
2008) (applying First Amendment to ajudge acting in her “capacity as a justice court judge” at a
conference seeking certification to start a drug court, but held that First Amendment did not apply
because judge’s insulting comments were not matters of “legitimate public concern.”); In re Rome,
218 Kan. 198, 542 P.2d 676, 684 (1975) (<HOLDING>). 11 As indicated in the Gentile syllabus,

Please select the correct holding statement from the options below.

A. holding that free speech protection of new jersey constitution requires subject to rea-
sonable restrictions privatelyowned shopping centers to permit speech on political and societal
issues on premises unlike first amendment of federal constitution
B. recognizing that code is speech
C. holding that first amendment protections apply to compelled speech as well as restrictions on
speech
D. holding that although ajudge has the right of free speech any restrictions placed by the code of
professional responsibility are acceptable limits and prevent the first amendment from exempting a
judge from discipline for proven judicial misconduct
E. holding that the first amendment limits judicial discretion to seal documents in a civil case

Figure B.2: Formatted example from CaseHOLD.

System:

You are a educational A.I. whose purpose is to take math problems that students get
wrong and generate new problems to help them practice their mathematical skills. Your goal is to
generate a set of new math problems that reflect the different skills and techniques found in the
example problem.

Here are the requirements:

Always return your instructions in the form:

1. Question: [QUESTION]
Answer: [CALCULATION]
#### [ANSWER]

Figure B.6: System Prompt for GSM8K Generation

System:

You are LawGPT, an AI agent who knows everything there is to know about U.S. law.
You know the result of every court case and you know every law in the lawbook.
The user is trying to choose the correct holding of the case given the context and argument of the
court.
You are trying to give the user assistance by giving them more practice questions for the questions
that they get wrong.

Here are the requirements:

1. A GPT language model should be able to complete the problem. For example, do not ask the
assistant to create any visual or audio output. For another example, do not ask the assistant to
wake you up at 5pm or set a reminder because it cannot perform any action.
2. The context, holding, and options should be in english.
3. The questions that you generate should test for whether the user understands the case names and
their holdings and whether the user can re-frame relevant holdings to backup the argument in the
context.
4. The context should always end with a citation such as "See United States v. Newman, 125 F.3d
863 (10th Cir.1997) (unpublished) (<HOLDING>); United States v. Dodge, 846 F.Supp. 181,"
5. The citation absolutely needs to have the mask phrase <HOLDING> which is the place where
the legal holding would normally be.
6. The questions should always be multiple choice.
7. There should always be 5 options: 1 options should be a holding that backs up the argument in
the context, the other 4 should be sufficiently different. Each option has to start with the word
"holding"
8. There can only be 1 answer: A, B, C, D, or E.
9. Don’t make any mistakes matching the holdings yourself.
10. Try not to copy too much information from the original problem. You don’t want the user to
just memorize their answer.
11. Make the context similar to the context in question, make sure that the holding that is being
tested is the same.
12. The wrong answer choices can be any other reasonable holding, but it should be sufficiently
different from the correct answer.
13. Do not make your context too short. Remember, these arguments in the context are being
made by judges and should look like they were written by a judge.

Always return your instructions in the form:

1. Context: [CONTEXT]

Please select the correct holding statement from the options below.

A. [OPTION 1]
B. [OPTION 2]
C. [OPTION 3]
D. [OPTION 4]
E. [OPTION 5]
Answer: [ANSWER]

Figure B.8: System Prompt for CaseHOLD Generation

System:
You are TranscriptGPT, an AI agent who knows the intent of the transcript of different questions.
You are training someone how to identify people’s intents from what they have said.
You are trying to give the user assistance by giving them more practice questions for the questions
that they get wrong.

Here are the requirements:

Always return your instructions in the form:

1. Transcript: [CONTEXT]
Intent: [INTENT]

Figure B.10: System Prompt for SNIPS Generation

System:

You are QuestionGPT, an AI agent who knows the class of different question.
You are training someone how to classify different questions based on what the questions are
asking form.
You are trying to give the user assistance by giving them more practice questions for the questions
that they get wrong.

Here are the requirements:

Always return your instructions in the form:

1. Question: [CONTEXT]
Class: [INTENT]

Figure B.12: System Prompt for TREC Generation

User:

The following is a movie review that the user classified incorrectly including the correct
classification:
Classify the following movie review as positive or negative: as they come , already having been
recycled more times than i ’d care to count
Sentiment: negative

Generate 1 more similar movie review with the same class.

Assistant:

Here’s a similar question with the same class:

1. Review: Feels like a reheated plot, utterly predictable and uninspired.

Sentiment: negative

Figure B.15: In-Context Example for SST-2 Generation

ACL 2024 DA Survey Final
No ratings yet
ACL 2024 DA Survey Final
27 pages
Learning With Less: Knowledge Distillation From Large Language Models Via Unlabeled Data
No ratings yet
Learning With Less: Knowledge Distillation From Large Language Models Via Unlabeled Data
14 pages
Efficient Data Filtering for LLM Tuning
No ratings yet
Efficient Data Filtering for LLM Tuning
13 pages
A Survey On Data Synthesis and Augmentation For Large Language Models
No ratings yet
A Survey On Data Synthesis and Augmentation For Large Language Models
28 pages
ASurvey On Feedback-Based Multi-Step Reasoning For Large Language
No ratings yet
ASurvey On Feedback-Based Multi-Step Reasoning For Large Language
15 pages
Curated LLM: Synergy of Llms and Data Curation For Tabular Augmentation in Low-Data Regimes
No ratings yet
Curated LLM: Synergy of Llms and Data Curation For Tabular Augmentation in Low-Data Regimes
33 pages
Knowledge-Instruct - Effective Continual Pre-Training From Limited Data Using Instructions
No ratings yet
Knowledge-Instruct - Effective Continual Pre-Training From Limited Data Using Instructions
23 pages
Clear Minds Think Alike: What Makes LLM Fine-Tuning Robust? A Study of Token Perplexity
No ratings yet
Clear Minds Think Alike: What Makes LLM Fine-Tuning Robust? A Study of Token Perplexity
28 pages
Multi-Stage LLM Fine-Tuning Method
No ratings yet
Multi-Stage LLM Fine-Tuning Method
14 pages
Enhancing LLMs with Grounding Prompts
No ratings yet
Enhancing LLMs with Grounding Prompts
14 pages
Inductive-Bias Learning
No ratings yet
Inductive-Bias Learning
17 pages
LLMs vs SLMs in Few-shot Information Extraction
No ratings yet
LLMs vs SLMs in Few-shot Information Extraction
30 pages
Jason Weston Reasoning Alignment Berkeley Talk
No ratings yet
Jason Weston Reasoning Alignment Berkeley Talk
106 pages
LLMQuoter - Enhancing RAG Capabilities Through Efficient Quote
No ratings yet
LLMQuoter - Enhancing RAG Capabilities Through Efficient Quote
12 pages
A Practical Guide To Fine-Tuning Language Models With Limited Data
No ratings yet
A Practical Guide To Fine-Tuning Language Models With Limited Data
27 pages
3 Paradigm 2: Prompt-Based Learning: Table 2: Example Prompt Designs For Learning From In-Structions
No ratings yet
3 Paradigm 2: Prompt-Based Learning: Table 2: Example Prompt Designs For Learning From In-Structions
10 pages
D4 Improving LLM Pretraining Via Document De-Duplication and Diversification
No ratings yet
D4 Improving LLM Pretraining Via Document De-Duplication and Diversification
13 pages
HARE - HumAn Priors - Key To Small Language Model Efficiency
No ratings yet
HARE - HumAn Priors - Key To Small Language Model Efficiency
10 pages
Advanced Prompt Engineering
No ratings yet
Advanced Prompt Engineering
27 pages
Synthetic Data Generation in Low-Resource Settings Via Fine-Tuning of Large Language Models
No ratings yet
Synthetic Data Generation in Low-Resource Settings Via Fine-Tuning of Large Language Models
12 pages
Enhancing LLMs with Web-Crawled Data
No ratings yet
Enhancing LLMs with Web-Crawled Data
16 pages
参数高效的llmEraser
No ratings yet
参数高效的llmEraser
24 pages
SSRN 5266676
No ratings yet
SSRN 5266676
31 pages
Large Language Models For Data Annotation - A Survey
No ratings yet
Large Language Models For Data Annotation - A Survey
22 pages
Synthetic Data RL: Task Definition Is All You Need: Yiduo Guo Zhen Guo Chuanwei Huang Zi-Ang Wang
No ratings yet
Synthetic Data RL: Task Definition Is All You Need: Yiduo Guo Zhen Guo Chuanwei Huang Zi-Ang Wang
34 pages
Comparing Text Augmentation by GPT-3.5 and Llama3 For Evaluating Student Responses
No ratings yet
Comparing Text Augmentation by GPT-3.5 and Llama3 For Evaluating Student Responses
27 pages
Paper - Math - Supercorrect - Supervising and Correcting Language Models With Error-Driven Insights
No ratings yet
Paper - Math - Supercorrect - Supervising and Correcting Language Models With Error-Driven Insights
27 pages
FT Domain Adaptation
No ratings yet
FT Domain Adaptation
20 pages
T K C P: A LLM E E R L: RUE Nowledge Omes From Ractice Ligning S With Mbodied Nvironments VIA Einforcement Earning
No ratings yet
T K C P: A LLM E E R L: RUE Nowledge Omes From Ractice Ligning S With Mbodied Nvironments VIA Einforcement Earning
48 pages
ALERT - Adapting Language Models To Reasoning Tasks
No ratings yet
ALERT - Adapting Language Models To Reasoning Tasks
26 pages
68 LLM Informed Discrete Promp
No ratings yet
68 LLM Informed Discrete Promp
6 pages
Introduction To LLM
No ratings yet
Introduction To LLM
16 pages
Self-Taught Evaluators for LLMs
No ratings yet
Self-Taught Evaluators for LLMs
15 pages
Data Management For Training Large Language Models
No ratings yet
Data Management For Training Large Language Models
22 pages
ELREA 多个lora适配器动态选取
No ratings yet
ELREA 多个lora适配器动态选取
29 pages
Enhancing ToM in Language Models
No ratings yet
Enhancing ToM in Language Models
27 pages
Week 11 Chats
No ratings yet
Week 11 Chats
5 pages
Feb2024 Machine Unlearning
No ratings yet
Feb2024 Machine Unlearning
15 pages
Auggpt: Leveraging Chatgpt For Text Data Augmentation: Ntroduction
No ratings yet
Auggpt: Leveraging Chatgpt For Text Data Augmentation: Ntroduction
12 pages
LIMO: Less Is More For Reasoning: Yixin Ye, Zhen Huang, Yang Xiao, Ethan Chern, Shijie Xia, Pengfei Liu
No ratings yet
LIMO: Less Is More For Reasoning: Yixin Ye, Zhen Huang, Yang Xiao, Ethan Chern, Shijie Xia, Pengfei Liu
16 pages
Structural Entropy Guided Agent For Detecting and Repairing Knowledge Deficiencies in LLMs
No ratings yet
Structural Entropy Guided Agent For Detecting and Repairing Knowledge Deficiencies in LLMs
19 pages
Li Et Al., 2024, Generative AI For Self-Adaptive Systems State of The Art and Research Roadmap
No ratings yet
Li Et Al., 2024, Generative AI For Self-Adaptive Systems State of The Art and Research Roadmap
26 pages
Research Paper
No ratings yet
Research Paper
14 pages
Language Models Can Exploit Cross-Task In-Context Learning For Data-Scarce Novel Tasks
No ratings yet
Language Models Can Exploit Cross-Task In-Context Learning For Data-Scarce Novel Tasks
20 pages
Compositionality and Memorization in LLMs
No ratings yet
Compositionality and Memorization in LLMs
15 pages
Does Fine-Tuning Llms On New Knowledge Encourage Hallucinations?
No ratings yet
Does Fine-Tuning Llms On New Knowledge Encourage Hallucinations?
17 pages
In-Context Learning Creates Task Vectors
No ratings yet
In-Context Learning Creates Task Vectors
16 pages
Annotated Paper
No ratings yet
Annotated Paper
17 pages
Offset Unlearning For Large Language Models
No ratings yet
Offset Unlearning For Large Language Models
11 pages
Unsupervised Data Validation Methods For
No ratings yet
Unsupervised Data Validation Methods For
10 pages
2022.Acl-long.292-Prompt-based Data Augmentation For Low-Resource NLU Tasks
No ratings yet
2022.Acl-long.292-Prompt-based Data Augmentation For Low-Resource NLU Tasks
14 pages
Smaller, Weaker, Yet Better: Training LLM Reasoners Via Compute-Optimal Sampling
No ratings yet
Smaller, Weaker, Yet Better: Training LLM Reasoners Via Compute-Optimal Sampling
25 pages
AI Model Optimization Guide
100% (1)
AI Model Optimization Guide
1 page
60 Answering Causal Questions
No ratings yet
60 Answering Causal Questions
14 pages
LLM Reasoning Strategies Survey
No ratings yet
LLM Reasoning Strategies Survey
15 pages
Unified Law Arxiv
No ratings yet
Unified Law Arxiv
10 pages
Introduction to Language Models
No ratings yet
Introduction to Language Models
43 pages
Towards Large Reasoning Models
No ratings yet
Towards Large Reasoning Models
36 pages
Circular For Parents
No ratings yet
Circular For Parents
2 pages
Celt P
No ratings yet
Celt P
2 pages
Power BI Data Visualization Project
No ratings yet
Power BI Data Visualization Project
16 pages
Chemistry - 2024 OSC Subject Tips and Tricks - Webinar Presentation
No ratings yet
Chemistry - 2024 OSC Subject Tips and Tricks - Webinar Presentation
27 pages
Bottom-Up Evaluation of S-Attributed Definitions
No ratings yet
Bottom-Up Evaluation of S-Attributed Definitions
53 pages
Lesson 1 - Build A Terrarium
No ratings yet
Lesson 1 - Build A Terrarium
9 pages
WINS User Guide
No ratings yet
WINS User Guide
8 pages
Construction Project Manager Resume - Career FAQs PDF
No ratings yet
Construction Project Manager Resume - Career FAQs PDF
4 pages
Application Status Report
No ratings yet
Application Status Report
1,271 pages
Human Resource Management Past Papers
No ratings yet
Human Resource Management Past Papers
44 pages
Physics Exam for University Students
No ratings yet
Physics Exam for University Students
4 pages
G9 Week 14 Q2
No ratings yet
G9 Week 14 Q2
4 pages
IUBAT Bulletin
100% (1)
IUBAT Bulletin
413 pages
Reflex and Voluntary Actions
No ratings yet
Reflex and Voluntary Actions
3 pages
Lauren K. Mann: Extracurricular Activities
No ratings yet
Lauren K. Mann: Extracurricular Activities
4 pages
English Chest 4 Student Book Overview
0% (1)
English Chest 4 Student Book Overview
9 pages
AIX Hardering
No ratings yet
AIX Hardering
4 pages
Fall 2025 - STA630 - 1
No ratings yet
Fall 2025 - STA630 - 1
3 pages
Free Flow Guide
100% (1)
Free Flow Guide
16 pages
A Golem On The Playground
No ratings yet
A Golem On The Playground
102 pages
Eiichi Nakamura (Chemist)
No ratings yet
Eiichi Nakamura (Chemist)
4 pages
2425 S1 Lesson 6 Interpersonal Relationship
No ratings yet
2425 S1 Lesson 6 Interpersonal Relationship
2 pages
Academic Writing Errors in Bangladesh
No ratings yet
Academic Writing Errors in Bangladesh
98 pages
Afm in Zimbabwe - CM Syllabus Draft2016
No ratings yet
Afm in Zimbabwe - CM Syllabus Draft2016
31 pages
Teacher Talk & Classroom Interaction
100% (1)
Teacher Talk & Classroom Interaction
5 pages
Sarah Rose Nordgren: Poetry and Academia
No ratings yet
Sarah Rose Nordgren: Poetry and Academia
11 pages
Strategic Planning in Marketing Management
No ratings yet
Strategic Planning in Marketing Management
55 pages
Writing PIR Research Papers
No ratings yet
Writing PIR Research Papers
9 pages
Grade 7 Mental Health Lessons
No ratings yet
Grade 7 Mental Health Lessons
12 pages
JN 123
No ratings yet
JN 123
10 pages