create exercise to rewrite minbpe for learners

yashch22 · Feb 19, 2024 · a98e0a5 · a98e0a5
1 parent 1cc9d06
commit a98e0a5
Show file tree

Hide file tree

Showing 2 changed files with 59 additions and 0 deletions.
diff --git a/README.md b/README.md
@@ -126,6 +126,10 @@ $ pytest -v .
 
 to run the tests. (-v is verbose, slightly prettier).
 
+## exercise
+
+For those trying to study BPE, here is the advised progression exercise for how you can build your own minbpe step by step. See [exercise.md](exercise.md).
+
 ## todos
 
 - write a more optimized Python version that could run over large files and big vocabs

diff --git a/exercise.md b/exercise.md
@@ -0,0 +1,55 @@
+# exercise
+
+Build your own GPT-4 Tokenizer!
+
+### Step 1
+
+Write the `BasicTokenizer` class, with the following three core functions:
+
+- `def train(self, text, vocab_size, verbose=False)`
+- `def encode(self, text)`
+- `def decode(self, ids)`
+
+Train your tokenizer on whatever text you like and visualize the merged tokens. Do they look reasonable? One default test you may wish to use is the text file `tests/taylorswift.txt`.
+
+### Step 2
+
+Convert you `BasicTokenizer` into a `RegexTokenizer`, which takes a regex pattern and splits the text exactly as GPT-4 would. Process the parts separately as before, then concatenate the results. Retrain your tokenizer and compare the results before and after. You should see that you will now have no tokens that go across categories (numbers, letters, punctuation, more than one whitespace). Use the GPT-4 pattern:
+
+```
+GPT4_SPLIT_PATTERN = r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+"""
+```
+
+
+### Step 3
+
+You're now ready to load the merges from the GPT-4 tokenizer and show that your tokenizer produces the identical results for both `encode` and `decode`, matching [tiktoken](https://github.com/openai/tiktoken).
+
+```
+# match this
+import tiktoken
+enc = tiktoken.get_encoding("cl100k_base") # this is the GPT-4 tokenizer
+ids = enc.encode("hello world!!!? (안녕하세요!) lol123 😉")
+text = enc.decode(ids) # get the same text back
+```
+
+Unfortunately, you will run into two issues:
+
+1. It is not trivial to recover the raw merges from the GPT-4 tokenizer. You can easily recover what we call `vocab` here, and what they call and store under `enc._mergeable_ranks`. Feel free to copy paste the `recover_merges` function in `minbpe/gpt4.py`, which takes these ranks and returns the raw merges. If you wish to know how this function works, read [this](https://github.com/openai/tiktoken/issues/60) and [this](https://github.com/karpathy/minbpe/issues/11#issuecomment-1950805306). Basically, under some conditions it is enough to only store the parent nodes (and their rank) and get rid of the precise details of which children merged up to any parent.
+2. Second, the GPT-4 tokenizer for some reason permutes its raw bytes. It stores this permutation in the first 256 elements of the mergeable ranks, so you can recover this byte shuffle relatively simply as `byte_shuffle = {i: enc._mergeable_ranks[bytes([i])] for i in range(256)}`. In both your encode and decode, you'll have to shuffle bytes around accordingly. If you're stuck, reference the minbpe/gpt4.py` file for hints.
+
+### Step 4
+
+(Optional, irritating, not obviously useful) Add the ability to handle special tokens. You'll then be able to match the output of tiktoken even when special tokens are present, e.g.:
+
+```
+import tiktoken
+enc = tiktoken.get_encoding("cl100k_base") # this is the GPT-4 tokenizer
+ids = enc.encode("<|endoftext|>hello world", allowed_special="all")
+```
+
+Without `allowed_special` tiktoken will error.
+
+### Step 5
+
+If you've made it this far, you're now a pro at LLM Tokenization! Sadly, you're not exactly done yet because a lot of LLMs outside of OpenAI (e.g. Llama, Mistral) use [sentencepiece](https://github.com/google/sentencepiece) instead. Primary difference being that sentencepiece runs BPE directly on Unicode code points instead of on UTF-8 encoded bytes. Feel free to explore sentencepiece on your own (good luck, it's not too pretty), and stretch goal if you really experience and suffer from the burden of time, re-write your BPE to be on Unicode code points and match the Llama 2 tokenizer.