Update README.md

mahdi-shafiei · pull · Feb 4, 2025 · Feb 4, 2025 · Feb 4, 2025 · 044f28e01d49305c0f166bf58fd600db1c3ee2ea
commit 044f28e01d49305c0f166bf58fd600db1c3ee2ea
diff --git a/README.md b/README.md
@@ -443,10 +443,11 @@ Explanations to key concepts in ML
 | Paper | Date | Description |
 |---|---|---|
 | [Self-Taught Reasoner (STaR)](https://ritvik19.medium.com/papers-explained-288-star-cf485a5b117e) | May 2022 |  A bootstrapping method that iteratively improves a language model's reasoning abilities by generating rationales for a dataset, filtering for rationales that lead to correct answers, fine-tuning the model on these successful rationales, and repeating this process, optionally augmented by "rationalization" where the model generates rationales given the correct answer as a hint. |
+| [Reinforced Self-Training (ReST)](https://ritvik19.medium.com/papers-explained-301-rest-6389371a68ac) | April 2024 | Iteratively improves a language model by generating a dataset of samples from the current policy (Grow step), filtering those samples based on a reward model derived from human preferences (Improve step), and then fine-tuning the model on the filtered data using an offline RL objective, repeating this process with increasing filtering thresholds to continually refine the model's output quality. |
+| [ReST^EM](https://ritvik19.medium.com/papers-explained-302-rest-em-9abe7c76936e) | December 2023 | A self-training method based on expectation-maximization for reinforcement learning with language models. It iteratively generates samples from the model, filters them using binary feedback (E-step), and fine-tunes the base pretrained model on these filtered samples (M-step). Unlike the original ReST, ReST^EM doesn't augment with human data and fine-tunes the base model each iteration, improving transfer performance. |
 | [Direct Preference Optimization](https://ritvik19.medium.com/papers-explained-148-direct-preference-optimization-d3e031a41be1) | December 2023 | A stable, performant, and computationally lightweight algorithm that fine-tunes llms to align with human preferences without the need for reinforcement learning, by directly optimizing for the policy best satisfying the preferences with a simple classification objective. |
 | [V-STaR](https://ritvik19.medium.com/papers-explained-289-v-star-4d2aeedab861) | February 2024 | Iteratively improves a language model's reasoning abilities by training a verifier with Direct Preference Optimization (DPO) on both correct and incorrect solutions generated by the model, while simultaneously fine-tuning the generator on only the correct solutions, ultimately using the verifier at inference time to select the best solution among multiple candidates. |
 | [RAFT](https://ritvik19.medium.com/papers-explained-272-raft-5049520bcc26) | March 2024 | A training method that enhances the performance of LLMs for open-book in-domain question answering by training them to ignore irrelevant documents, cite verbatim relevant passages, and promote logical reasoning. |
-| [Reinforced Self-Training (ReST)](https://ritvik19.medium.com/papers-explained-301-rest-6389371a68ac) | April 2024 | Iteratively improves a language model by generating a dataset of samples from the current policy (Grow step), filtering those samples based on a reward model derived from human preferences (Improve step), and then fine-tuning the model on the filtered data using an offline RL objective, repeating this process with increasing filtering thresholds to continually refine the model's output quality. |
 | [RLHF Workflow](https://ritvik19.medium.com/papers-explained-149-rlhf-workflow-56b4e00019ed) | May 2024 | Provides a detailed recipe for  online iterative RLHF and achieves state-of-the-art performance on various benchmarks using fully open-source datasets. |
 | [Magpie](https://ritvik19.medium.com/papers-explained-183-magpie-0603cbdc69c3) | June 2024 | A self-synthesis method that extracts high-quality instruction data at scale by prompting an aligned LLM with left-side templates, generating 4M instructions and their corresponding responses. |
 | [Instruction Pre-Training](https://ritvik19.medium.com/papers-explained-184-instruction-pretraining-ee0466f0fd33) | June 2024 | A framework to augment massive raw corpora with instruction-response pairs enabling supervised multitask pretraining of LMs. |