Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pull] main from dair-ai:main #167

Merged
merged 1 commit into from
Feb 4, 2025
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Update README.md
  • Loading branch information
Ritvik19 authored Feb 4, 2025
commit 044f28e01d49305c0f166bf58fd600db1c3ee2ea
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -443,10 +443,11 @@ Explanations to key concepts in ML
| Paper | Date | Description |
|---|---|---|
| [Self-Taught Reasoner (STaR)](https://ritvik19.medium.com/papers-explained-288-star-cf485a5b117e) | May 2022 | A bootstrapping method that iteratively improves a language model's reasoning abilities by generating rationales for a dataset, filtering for rationales that lead to correct answers, fine-tuning the model on these successful rationales, and repeating this process, optionally augmented by "rationalization" where the model generates rationales given the correct answer as a hint. |
| [Reinforced Self-Training (ReST)](https://ritvik19.medium.com/papers-explained-301-rest-6389371a68ac) | April 2024 | Iteratively improves a language model by generating a dataset of samples from the current policy (Grow step), filtering those samples based on a reward model derived from human preferences (Improve step), and then fine-tuning the model on the filtered data using an offline RL objective, repeating this process with increasing filtering thresholds to continually refine the model's output quality. |
| [ReST^EM](https://ritvik19.medium.com/papers-explained-302-rest-em-9abe7c76936e) | December 2023 | A self-training method based on expectation-maximization for reinforcement learning with language models. It iteratively generates samples from the model, filters them using binary feedback (E-step), and fine-tunes the base pretrained model on these filtered samples (M-step). Unlike the original ReST, ReST^EM doesn't augment with human data and fine-tunes the base model each iteration, improving transfer performance. |
| [Direct Preference Optimization](https://ritvik19.medium.com/papers-explained-148-direct-preference-optimization-d3e031a41be1) | December 2023 | A stable, performant, and computationally lightweight algorithm that fine-tunes llms to align with human preferences without the need for reinforcement learning, by directly optimizing for the policy best satisfying the preferences with a simple classification objective. |
| [V-STaR](https://ritvik19.medium.com/papers-explained-289-v-star-4d2aeedab861) | February 2024 | Iteratively improves a language model's reasoning abilities by training a verifier with Direct Preference Optimization (DPO) on both correct and incorrect solutions generated by the model, while simultaneously fine-tuning the generator on only the correct solutions, ultimately using the verifier at inference time to select the best solution among multiple candidates. |
| [RAFT](https://ritvik19.medium.com/papers-explained-272-raft-5049520bcc26) | March 2024 | A training method that enhances the performance of LLMs for open-book in-domain question answering by training them to ignore irrelevant documents, cite verbatim relevant passages, and promote logical reasoning. |
| [Reinforced Self-Training (ReST)](https://ritvik19.medium.com/papers-explained-301-rest-6389371a68ac) | April 2024 | Iteratively improves a language model by generating a dataset of samples from the current policy (Grow step), filtering those samples based on a reward model derived from human preferences (Improve step), and then fine-tuning the model on the filtered data using an offline RL objective, repeating this process with increasing filtering thresholds to continually refine the model's output quality. |
| [RLHF Workflow](https://ritvik19.medium.com/papers-explained-149-rlhf-workflow-56b4e00019ed) | May 2024 | Provides a detailed recipe for online iterative RLHF and achieves state-of-the-art performance on various benchmarks using fully open-source datasets. |
| [Magpie](https://ritvik19.medium.com/papers-explained-183-magpie-0603cbdc69c3) | June 2024 | A self-synthesis method that extracts high-quality instruction data at scale by prompting an aligned LLM with left-side templates, generating 4M instructions and their corresponding responses. |
| [Instruction Pre-Training](https://ritvik19.medium.com/papers-explained-184-instruction-pretraining-ee0466f0fd33) | June 2024 | A framework to augment massive raw corpora with instruction-response pairs enabling supervised multitask pretraining of LMs. |
Expand Down