How Should We Increase The User's Training Data To Be Enough For Fine-Tuning?
How Should We Increase The User's Training Data To Be Enough For Fine-Tuning?
Accuracy (%)
20 60
15
74 (1%) seed data 50 225 (0.5%) seed data
10
149 (2%) seed data 450 (1%) seed data
5 373 (5%) seed data 40 900 (2%) seed data
747 (10%) seed data 2250 (5%) seed data
0
0 500 1000 1500 2000 2500 3000 500 1000 1500 2000 2500
Total Data Size (Seed + LLM2LLM Data) Total Data Size (Seed + LLM2LLM Data)
Figure 2: LLM2LLM on GSM8K (left) and CaseHOLD (right) with various seed data sizes. Each line shows
the test accuracy of the finetuned Llama-2-7B model with each step of LLM2LLM with varying seed dataset
size. The first (left-most) data point on each line represents finetuning only on the seed data. Each point afterward
corresponds to the performance after one more iteration of LLM2LLM. The total data size (x-axis) represents the
total amount of seed plus LLM2LLM data that was used to train the model at that step. By applying LLM2LLM
with low amounts of seed data and iteratively improving the training dataset, we can attain significant performance
improvements. In particular, we can see that running LLM2LLM can match or even exceed the performance of
simply annotating more real data in some cases (detailed breakdown provided in Table 1).
W i during dataset generation to only data from used to evaluate final model accuracy. This ad-
the original seed data allows us to bound the total dresses the issue of potential test data leakage that
number of training data points to could have happened if the teacher model had been
trained on similar data.
j
X
j
|D | = n + npi ≤ n(1 + jpmax ), 4.2 Main Results
i=0
Here, we discuss LLM2LLM’s performance with
which has an upper-bound that grows linearly with varying amount of training data by presenting re-
the number of steps. The empirical evaluations sults for fine-tuning Llama-2-7B on GSM8K using
shown in Section 4.5.2 (Table 4) corroborates this. GPT-3.5 as the teacher model. We then discuss how
these trends extend to different datasets (Table 1).
4 Results The final model accuracy after applying 10 iter-
ations of LLM2LLM is given in Table 1. For a
4.1 Experimental Setup low-data regime with 74 available examples (i.e.,
To evaluate the performance of LLM2LLM, we 1% of the GSM8K training dataset), vanilla fine-
applied our framework to fine-tune Llama-2-7B on tuning achieves only 0.99% test accuracy. However,
various target datasets, including GSM8K (Cobbe LLM2LLM boosts the accuracy to 19.56% by iter-
et al., 2021), CaseHOLD (Nori et al., 2023), SNIPS atively generating 391 additional examples based
(Coucke et al., 2018), TREC (Li and Roth, 2002) on data points where the model makes mistakes.
and SST-2 (Socher et al., 2013). We subsampled With slightly more available data of 149 seed ex-
these datasets with different sampling rates from amples (i.e., 2% of the training dataset) we can
0.02% to 50% to evaluate performance across dif- achieve 25.70% accuracy. As shown in the base-
ferent low-data regimes. Our teacher model for line accuracy with 20% data in Table 2, we would
these results is GPT-3.5 (1106 release) unless other- need over 10× more training data points to match
wise specified. We considered several other teacher this accuracy if we only rely on vanilla fine-tuning.
models, including GPT-3.5, GPT-4-Turbo, Llama- We also highlight that LLM2LLM can lead to no-
2-70B (Touvron et al., 2023b), and Airoboros-L2- ticeable gains with data-sufficient regimes (e.g.,
70B (Durbin, 2023) in Section 4.4. We include 100% data), albeit at a smaller improvement over
a more detailed experimental setup in Section A. the baseline compared to lower-data regimes.
Additionally, we conducted additional experiments We observe a similar trend for CaseHOLD,
(Section B.6) to ensure that the augmented data SNIPS, TREC, and SST-2, where LLM2LLM
from the teacher models differs from the test dataset helps improve performance in the low-data regime.
Dataset Steps Total Aug. Acc. (%) From-scratch Total
Dataset Acc. (%)
Fine-tuning Aug.
1 (one-shot) 490 16.30
GSM8K
10 (iterative) 471 23.73 ✗ 230 14.71
GSM8K
✓ 471 23.75
1 (one-shot) 276 59.94
CaseHOLD
10 (iterative) 198 64.50 ✗ 154 60.50
CaseHOLD
✓ 198 64.50
Table 3: Ablation on the iterative nature of LLM2LLM
with 100 seed data points. Steps refers to the total num- Table 5: Ablation study on whether to fine-tune from
ber of augmentation steps in LLM2LLM. For the case scratch or to do continuous fine-tuning. From-scratch
of 1 iteration, we prompt the teacher model to generate Fine-tuning refers to whether we fine-tune the base
more samples all at once, whereas in the 10 steps case model from scratch vs. fine-tune the previous step’s
the teacher model only generates 1 new data point per model. Total Aug. refers to the total number of augmen-
wrong example. The results clearly show that the latter tated examples generated over 10 steps of LLM2LLM.
iterative approach results in better performance.
well as excessive amounts of total augmented data
Only Aug. Total points, as we demonstrate in Table 4. On GSM8K,
Dataset Acc. (%)
Seed Data Aug.
generating augmented data from the previous it-
✗ 4302 18.32
GSM8K
✓ 471 23.75 eration’s augmented data yields 18.3% accuracy,
while using the seed data for further augmentation
✗ 351 63.75
CaseHOLD improves the accuracy to 23.75%. We observe a
✓ 198 64.50
similar trend for CaseHOLD. As discussed in Sec-
Table 4: Ablation study on whether to augment pre-
tion 3.1, a potential reason for the performance
viously generated LLM2LLM data. Only Aug. Seed
Data refers to augmenting only the seed data vs. also drop, when using augmented data for further aug-
re-augmenting the augmented data. Total Aug. refers mentation, has to do with a deviation from the
to the total number of augmentations generated over 10 original data distribution.
steps of LLM2LLM.
4.5.3 From-scratch Fine-tuning vs
a single iteration, for both the GSM8K and Case- Continuous Fine-tuning
HOLD datasets. As shown in Table 3, using a Another key decision for LLM2LLM is whether to
single augmentation step with a larger amount continue fine-tuning from the last iteration’s check-
of augmented data significantly underperforms point (i.e. continuous fine-tuning) or to restart fine-
the alternative of executing 10 iterative steps of tuning from the pre-trained model at each itera-
LLM2LLM with a smaller number of augmenta- tion (i.e. from-scratch fine-tuning). Considering
tions per iteration. In particular, on GSM8K, aug- the non-convex nature of the optimization target
menting one data point per example over 10 steps and complex loss landscapes, this decision is not
yields a 7.4% higher accuracy than augmenting necessarily obvious. Nevertheless, as shown in Ta-
five data points per example in a single step. Simi- ble 5, we observe that from-scratch fine-tuning con-
larly, on CaseHOLD, iterative augmentation of one sistently and significantly outperforms continuous
data points per example over 10 steps results in a fine-tuning, with up to 9% accuracy improvement.
4.6% improvement over a one-shot augmentation The inferior performance of continuous fine-tuning
with four data points per example. This justifies can be attributed to a potential overfitting to small
the LLM2LLM’s iterative augmentation approach seed data over multiple iterations of fine-tuning,
that generates one data point per each incorrectly especially in lower-data regimes where the seed
answered example. data is small. This can be alleviated by restarting
fine-tuning from scratch in each iteration with suf-
4.5.2 Data Augmentation with Seed Data vs ficient augmented data appended to the seed data
Augmented Data to form the training dataset.
In each iteration, LLM2LLM evaluates the stu-
5 Conclusion
dent model’s performance only on the original seed
dataset and generates augmented data from incor- We have introduced LLM2LLM, an adaptive and
rect seed examples. However, a possible alternative iterative LLM-based data augmentation framework
is performing evaluation and data augmentation us- that uses LLMs to scale up smaller fine-tuning
ing both seed and previously augmented data. The datasets in lieu of manually generating more data.
latter often leads to sub-optimal performance as This framework substantially reduces the amount
Howland, Andrea Hu, Jeffrey Hui, Jeremy Hur- Wang, Mostafa Dehghani, Siddhartha Brahma, et al.
witz, Michael Isard, Abe Ittycheriah, Matthew Jagiel- 2022. Scaling instruction-finetuned language models.
ski, Wenhao Jia, Kathleen Kenealy, Maxim Krikun, arXiv preprint arXiv:2210.11416.
Sneha Kudugunta, Chang Lan, Katherine Lee, Ben-
jamin Lee, Eric Li, Music Li, Wei Li, YaGuang Li, Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian,
Jian Li, Hyeontaek Lim, Hanzhao Lin, Zhongtao Liu, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias
Frederick Liu, Marcello Maggioni, Aroma Mahendru, Plappert, Jerry Tworek, Jacob Hilton, Reiichiro
Joshua Maynez, Vedant Misra, Maysam Moussalem, Nakano, Christopher Hesse, and John Schulman.
Zachary Nado, John Nham, Eric Ni, Andrew Nys- 2021. Training verifiers to solve math word prob-
trom, Alicia Parrish, Marie Pellat, Martin Polacek, lems. arXiv preprint arXiv:2110.14168.
Alex Polozov, Reiner Pope, Siyuan Qiao, Emily Reif,
Bryan Richter, Parker Riley, Alex Castro Ros, Au- Alice Coucke, Alaa Saade, Adrien Ball, Théodore
rko Roy, Brennan Saeta, Rajkumar Samuel, Renee Bluche, Alexandre Caulier, David Leroy, Clément
Shelby, Ambrose Slone, Daniel Smilkov, David R. Doumouro, Thibault Gisselbrecht, Francesco Calta-
So, Daniel Sohn, Simon Tokumine, Dasha Valter, girone, Thibaut Lavril, et al. 2018. Snips voice plat-
Vijay Vasudevan, Kiran Vodrahalli, Xuezhi Wang, form: an embedded spoken language understanding
Pidong Wang, Zirui Wang, Tao Wang, John Wiet- system for private-by-design voice interfaces. arXiv
ing, Yuhuai Wu, Kelvin Xu, Yunhan Xu, Linting preprint arXiv:1805.10190.
Xue, Pengcheng Yin, Jiahui Yu, Qiao Zhang, Steven Claude Coulombe. 2018. Text data augmentation made
Zheng, Ce Zheng, Weikang Zhou, Denny Zhou, Slav simple by leveraging nlp cloud apis. arXiv preprint
Petrov, and Yonghui Wu. 2023. Palm 2 technical arXiv:1812.04718.
report.
Haixing Dai, Zhengliang Liu, Wenxiong Liao, Xiaoke
Vamsi Aribandi, Yi Tay, Tal Schuster, Jinfeng Rao, Huang, Yihan Cao, Zihao Wu, Lin Zhao, Shaochen
Huaixiu Steven Zheng, Sanket Vaibhav Mehta, Hon- Xu, Wei Liu, Ninghao Liu, Sheng Li, Dajiang Zhu,
glei Zhuang, Vinh Q Tran, Dara Bahri, Jianmo Ni, Hongmin Cai, Lichao Sun, Quanzheng Li, Dinggang
et al. 2021. Ext5: Towards extreme multi-task scal- Shen, Tianming Liu, and Xiang Li. 2023. Auggpt:
ing for transfer learning. In International Conference Leveraging chatgpt for text data augmentation.
on Learning Representations.
Yihe Deng, Weitong Zhang, Zixiang Chen, and Quan-
Yonatan Belinkov and Yonatan Bisk. 2017. Synthetic quan Gu. 2023a. Rephrase and respond: Let large
and natural noise both break neural machine transla- language models ask better questions for themselves.
tion. arXiv preprint arXiv:1711.02173.
Yuntian Deng, Kiran Prasad, Roland Fernandez,
Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Paul Smolensky, Vishrav Chaudhary, and Stuart
Bowen Baker, Leo Gao, Leopold Aschenbrenner, Shieber. 2023b. Gpt-4 turbo v.s. gpt-4 com-
Yining Chen, Adrien Ecoffet, Manas Joglekar, Jan parison. https://github.com/da03/implicit_
Leike, Ilya Sutskever, and Jeff Wu. 2023. Weak-to- chain_of_thought/tree/main/gpt4_baselines.
strong generalization: Eliciting strong capabilities
with weak supervision. Bosheng Ding, Chengwei Qin, Linlin Liu, Yew Ken
Chia, Boyang Li, Shafiq Joty, and Lidong Bing. 2023.
Yihan Cao, Yanbin Kang, Chi Wang, and Lichao Sun. Is gpt-3 a good data annotator? In Proceedings
2023. Instruction mining: When data mining meets of the 61st Annual Meeting of the Association for
large language model finetuning. Computational Linguistics (Volume 1: Long Papers),
pages 11173–11195.
Lichang Chen, Shiyang Li, Jun Yan, Hai Wang, Kalpa
Gunaratna, Vikas Yadav, Zheng Tang, Vijay Srini- Jon Durbin. 2023. Jondurbin/airoboros-l2-70b-3.1.2 ·
vasan, Tianyi Zhou, Heng Huang, and Hongxia Jin. hugging face.
2023. Alpagasus: Training a better alpaca with fewer
data. Luyang Fang, Gyeong-Geon Lee, and Xiaoming Zhai.
2023. Using gpt-4 to augment unbalanced data for
Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, automatic scoring.
and Quanquan Gu. 2024. Self-play fine-tuning con-
verts weak language models to strong language mod- Steven Y. Feng, Varun Gangal, Jason Wei, Sarath Chan-
els. arXiv preprint arXiv:2401.01335. dar, Soroush Vosoughi, Teruko Mitamura, and Ed-
uard Hovy. 2021. A survey of data augmentation
Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, approaches for NLP. In Findings of the Association
Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan for Computational Linguistics: ACL-IJCNLP 2021,
Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion pages 968–988, Online. Association for Computa-
Stoica, and Eric P. Xing. 2023. Vicuna: An open- tional Linguistics.
source chatbot impressing gpt-4 with 90%* chatgpt
quality. Yao Fu, Litu Ou, Mingyu Chen, Yuhao Wan, Hao Peng,
and Tushar Khot. 2023a. Chain-of-thought hub: A
Hyung Won Chung, Le Hou, Shayne Longpre, Barret continuous effort to measure large language models’
Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi reasoning performance.
Alham Fikri Aji, Khalid Almubarak, Samuel Al- Gerald Tesauro et al. 1995. Temporal difference learn-
banie, Zaid Alyafeai, Albert Webson, Edward Raff, ing and td-gammon. Communications of the ACM,
and Colin Raffel. 2023. Crosslingual generalization 38(3):58–68.
through multitask finetuning. In Annual Meeting of
the Association for Computational Linguistics. Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier
Martinet, Marie-Anne Lachaux, Timothée Lacroix,
Subhabrata Mukherjee, Arindam Mitra, Ganesh Jawa- Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal
har, Sahaj Agarwal, Hamid Palangi, and Ahmed Azhar, et al. 2023a. Llama: Open and effi-
Awadallah. 2023. Orca: Progressive learning from cient foundation language models. arXiv preprint
complex explanation traces of gpt-4. arXiv preprint arXiv:2302.13971.
arXiv:2306.02707. Hugo Touvron, Louis Martin, Kevin Stone, Peter Al-
bert, Amjad Almahairi, Yasmine Babaei, Nikolay
Harsha Nori, Yin Tat Lee, Sheng Zhang, Dean Carig- Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti
nan, Richard Edgar, Nicolo Fusi, Nicholas King, Bhosale, et al. 2023b. Llama 2: Open founda-
Jonathan Larson, Yuanzhi Li, Weishung Liu, et al. tion and fine-tuned chat models. arXiv preprint
2023. Can generalist foundation models outcom- arXiv:2307.09288.
pete special-purpose tuning? case study in medicine.
arXiv preprint arXiv:2311.16452. Solomon Ubani, Suleyman Olcay Polat, and Rodney
Nielsen. 2023. Zeroshotdataaug: Generating and aug-
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Car- menting training data with chatgpt. arXiv preprint
roll L. Wainwright, Pamela Mishkin, Chong Zhang, arXiv:2304.14334.
Sandhini Agarwal, Katarina Slama, Alex Ray, John
Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Shuohang Wang, Yang Liu, Yichong Xu, Chenguang
Maddie Simens, Amanda Askell, Peter Welinder, Zhu, and Michael Zeng. 2021. Want to reduce label-
Paul Christiano, Jan Leike, and Ryan Lowe. 2022. ing cost? gpt-3 can help. In Findings of the Associ-
instructgpt. ation for Computational Linguistics: EMNLP 2021,
pages 4195–4205.
Archiki Prasad, Elias Stengel-Eskin, and Mohit Bansal. Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa
2023. Rephrase, augment, reason: Visual grounding Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh
of questions for vision-language models. Hajishirzi. 2023. Self-instruct: Aligning language
models with self-generated instructions. In Proceed-
Arthur L Samuel. 2000. Some studies in machine learn- ings of the 61st Annual Meeting of the Association for
ing using the game of checkers. IBM Journal of Computational Linguistics (Volume 1: Long Papers),
research and development, 44(1.2):206–226. pages 13484–13508, Toronto, Canada. Association
for Computational Linguistics.
Victor Sanh, Albert Webson, Colin Raffel, Stephen
Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Yizhong Wang, Swaroop Mishra, Pegah Alipoormo-
Chaffin, Arnaud Stiegler, Arun Raja, Manan Dey, labashi, Yeganeh Kordi, Amirreza Mirzaei, Atharva
et al. 2021. Multitask prompted training enables Naik, Arjun Ashok, Arut Selvan Dhanasekaran, An-
zero-shot task generalization. In International Con- jana Arunkumar, David Stap, et al. 2022a. Super-
ference on Learning Representations. naturalinstructions: Generalization via declarative
instructions on 1600+ nlp tasks. In Proceedings of
Avi Singh, John D Co-Reyes, Rishabh Agarwal, Ankesh the 2022 Conference on Empirical Methods in Natu-
Anand, Piyush Patil, Peter J Liu, James Harri- ral Language Processing, pages 5085–5109.
son, Jaehoon Lee, Kelvin Xu, Aaron Parisi, et al.
2023. Beyond human data: Scaling self-training Yufei Wang, Jiayi Zheng, Can Xu, Xiubo Geng, Tao
for problem-solving with language models. arXiv Shen, Chongyang Tao, and Daxin Jiang. 2022b.
preprint arXiv:2312.06585. Knowda: All-in-one knowledge mixture model for
data augmentation in few-shot nlp. arXiv preprint
Richard Socher, Alex Perelygin, Jean Wu, Jason arXiv:2206.10265.
Chuang, Christopher D Manning, Andrew Y Ng, and Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu,
Christopher Potts. 2013. Recursive deep models for Adams Wei Yu, Brian Lester, Nan Du, Andrew M
semantic compositionality over a sentiment treebank. Dai, and Quoc V Le. 2021. Finetuned language mod-
In Proceedings of the 2013 conference on empiri- els are zero-shot learners. In International Confer-
cal methods in natural language processing, pages ence on Learning Representations.
1631–1642.
Jason Wei and Kai Zou. 2019. EDA: Easy data augmen-
Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann tation techniques for boosting performance on text
Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, classification tasks. In Proceedings of the 2019 Con-
and Tatsunori B Hashimoto. 2023. Alpaca: A ference on Empirical Methods in Natural Language
strong, replicable instruction-following model. Stan- Processing and the 9th International Joint Confer-
ford Center for Research on Foundation Models. ence on Natural Language Processing (EMNLP-
https://crfm. stanford. edu/2023/03/13/alpaca. html, IJCNLP), pages 6382–6388, Hong Kong, China. As-
3(6):7. sociation for Computational Linguistics.
A Experimental Setup A.2 Models
Table A.1: Experiments on how the quality of teacher model affects the performance of LLM2LLM. For each
of these experiments, we only change the teacher model to measure the effect of the teacher model on the final
outcome.
instructions for generating new data points from Dataset % Data Cost ($) Time (Hours)
stronger models such as GPT-3.5 and GPT-4. This 1% 0.35 3.28
was necessary as these approaches targeted improv- GSM8K 5% 1.48 9.07
10% 3.64 14.54
ing the LLM over a wide range of different tasks.
However, for LLM2LLM, we are trying to im- 1% 1.50 6.68
CaseHOLD 5% 0.84 16.87
prove the LLM at domain-specific tasks. Thus, for 10% 2.19 31.95
each task, the system prompt that we give to the
0.5% 0.02 0.85
teacher-model differs on a per-task basis. This al- SNIPS 0.8% 0.05 1.29
lows the user to inject and leverage domain-specific 1% 0.05 1.40
knowledge about the task into the dataset genera- 1.1% 0.05 0.67
tion procedure, creating higher quality fine-tuning TREC 1.6% 0.01 0.44
2.2% 0.02 0.61
data. In practice, we also use in-context learning
0.02% 0.01 0.54
with few-shot prompting to bootstrap the teacher
SST-2 0.04% 0.01 0.80
model’s ability to generate relevant questions. 0.06% 0.00 0.64
The detailed system prompt and in-context ex-
Table B.2: Training and data generation Costs of
amples for each dataset are provided below: LLM2LLM. The first and second columns indicate the
1. GSM8K: System (Figure B.6) and In-Context dataset and percentage of the training data used as initial
seed data for that experiment. The third column indi-
Examples (Figure B.7)
cates the total cost to generate the data from the GPT-3.5
2. CaseHOLD: System (Figure B.8) and In- teacher model. The fourth column shows the total time
Context Examples (Figure B.9) in hours to train and evaluate the student model. As we
can see, data generation costs for LLM2LLM are rela-
3. SNIPS: System (Figure B.10) and In-Context tively little compared to the cost of manually curating
Examples (Figure B.11) new data. Furthermore, fine-tuning and evaluation of
the student model finishes in a reasonable time.
4. TREC: System (Figure B.12) and In-Context
Examples (Figure B.13) these numbers using 4xA100-80GB PCIe based
5. SST-2: System (Figure B.14) and In-Context NVIDIA GPUs. As we can see, generating data
Examples (Figure B.15) for LLM2LLM costs relatively little compared to
the cost of collecting new data points manually.
B.5 Training and Data Generation Costs Furthermore, the process of fine-tuning the student
model also finishes in a reasonable amount of time.
In Table B.2, we report the training and data gener-
ation costs to perform LLM2LLM. This includes
B.6 Decontamination Experiments
the cost of generating new data from OpenAI as
well as the amount of GPU hours required to train When using an LLM to generate data, there are
and evaluate the student models. We measured potential concerns of data contamination i.e., when
GSM8K Example:
Victor, Austin, and Brian made traps to catch shrimp. Victor’s trap caught 26 shrimp
and Austin’s trap caught 8 less than Victor’s. Brian’s trap caught half of Victor and Austin’s total
number of shrimp. If the boys then sold their shrimp for $7 for every 11 tails of shrimp and then
divided their earnings equally amongst themselves, how much money does each boy make?
CaseHOLD Example:
The following context is from a judicial decision where the holding statement has been
masked out as <HOLDING>.
Context: from behind the bench in the robe is protected by the First Amendment, even
if his use of the trappings of judicial office were notprotected by First Amendment); Halleck v.
Berlinger, 427 F. Supp. 1225, 1241 (D.D.C. 1977) (applying the First Amendment in disciplinary
proceeding to comments made from the bench, but finding the particular comments outside of its
protection); Mississippi Comm’n on Judicial Performance v. Boland, 975 So.2d 882, 891-92 (Miss.
2008) (applying First Amendment to ajudge acting in her “capacity as a justice court judge” at a
conference seeking certification to start a drug court, but held that First Amendment did not apply
because judge’s insulting comments were not matters of “legitimate public concern.”); In re Rome,
218 Kan. 198, 542 P.2d 676, 684 (1975) (<HOLDING>). 11 As indicated in the Gentile syllabus,
Please select the correct holding statement from the options below.
A. holding that free speech protection of new jersey constitution requires subject to rea-
sonable restrictions privatelyowned shopping centers to permit speech on political and societal
issues on premises unlike first amendment of federal constitution
B. recognizing that code is speech
C. holding that first amendment protections apply to compelled speech as well as restrictions on
speech
D. holding that although ajudge has the right of free speech any restrictions placed by the code of
professional responsibility are acceptable limits and prevent the first amendment from exempting a
judge from discipline for proven judicial misconduct
E. holding that the first amendment limits judicial discretion to seal documents in a civil case
You are a educational A.I. whose purpose is to take math problems that students get
wrong and generate new problems to help them practice their mathematical skills. Your goal is to
generate a set of new math problems that reflect the different skills and techniques found in the
example problem.
You are LawGPT, an AI agent who knows everything there is to know about U.S. law.
You know the result of every court case and you know every law in the lawbook.
The user is trying to choose the correct holding of the case given the context and argument of the
court.
You are trying to give the user assistance by giving them more practice questions for the questions
that they get wrong.
Please select the correct holding statement from the options below.
A. [OPTION 1]
B. [OPTION 2]
C. [OPTION 3]
D. [OPTION 4]
E. [OPTION 5]
Answer: [ANSWER]
You are QuestionGPT, an AI agent who knows the class of different question.
You are training someone how to classify different questions based on what the questions are
asking form.
You are trying to give the user assistance by giving them more practice questions for the questions
that they get wrong.
The following is a movie review that the user classified incorrectly including the correct
classification:
Classify the following movie review as positive or negative: as they come , already having been
recycled more times than i ’d care to count
Sentiment: negative
Assistant: