remove educational as constraint. i think we might be able to make th…

…is repo have teeth and have some more optimized versions. but we'd keep the simple reference code
tombelieber · Feb 17, 2024 · 3e68413 · 3e68413
1 parent 2c521cb
commit 3e68413
Showing 1 changed file with 1 addition and 1 deletion.
diff --git a/README.md b/README.md
@@ -1,6 +1,6 @@
 # minbpe
 
-Minimal, clean, educational code for the (byte-level) Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization. The BPE algorithm is "byte-level" because it runs on UTF-8 encoded strings.
+Minimal, clean code for the (byte-level) Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization. The BPE algorithm is "byte-level" because it runs on UTF-8 encoded strings.
 
 This algorithm was popularized for LLMs by the [GPT-2 paper](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) and the associated GPT-2 [code release](https://github.com/openai/gpt-2) from OpenAI. [Sennrich et al. 2015](https://arxiv.org/abs/1508.07909) is cited as the original reference for the use of BPE in NLP applications. Today, all modern LLMs (e.g. GPT, Llama, Mistral) use this algorithm to train their tokenizers.