bunch of changes that just make sense

fortarch · Feb 16, 2024 · 43ffc2f · 43ffc2f
1 parent 1c0520f
commit 43ffc2f
Show file tree

Hide file tree

Showing 5 changed files with 123 additions and 117 deletions.
diff --git a/README.md b/README.md
@@ -2,27 +2,21 @@
 
 Minimal, clean, educational code for the (byte-level) Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization. The BPE algorithm is "byte-level" because it runs on UTF-8 encoded strings.
 
-This algorithm was popularized for LLMs by the [GPT-2 paper](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) and the associated GPT-2 [code release](https://github.com/openai/gpt-2) from OpenAI. Today, all modern LLMs (e.g. GPT, Llama, Mistral) use this algorithm to train their tokenizers.
+This algorithm was popularized for LLMs by the [GPT-2 paper](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) and the associated GPT-2 [code release](https://github.com/openai/gpt-2) from OpenAI. [Sennrich et al. 2015](https://arxiv.org/abs/1508.07909) is cited as the original reference for the use of BPE in NLP applications. Today, all modern LLMs (e.g. GPT, Llama, Mistral) use this algorithm to train their tokenizers.
 
-There are two Tokenizers in this repository, both of which can perform the 3 primary functions of a Tokenizer: 1) train the tokenizer vocabulary and merges on a given text, 2) encode from text to tokens, 3) decode from tokens to text. The two tokenizers are:
+There are two primary Tokenizers in this repository, both of which can perform the 3 primary functions of a Tokenizer: 1) train the tokenizer vocabulary and merges on a given text, 2) encode from text to tokens, 3) decode from tokens to text. The two tokenizers are:
 
-1. [bpe_basic.py](bpe_basic.py): The simplest implementation of the BPE algorithm that runs directly on text.
-2. [bpe_regex.py](bpe_regex.py): This implementation further splits the input text by a regex pattern, which is a preprocessing stage that splits up the input text by categories (think: letters, numbers, punctuation) before tokenization. This ensures that no merges will happen across category boundaries. This was introduced in the GPT-2 paper and continues to be in use as of GPT-4.
+1. [bpe_basic.py](bpe_basic.py): Implements the `BasicTokenizer`, the simplest implementation of the BPE algorithm that runs directly on text.
+2. [bpe_regex.py](bpe_regex.py): Implements the `RegexTokenizer` that further splits the input text by a regex pattern, which is a preprocessing stage that splits up the input text by categories (think: letters, numbers, punctuation) before tokenization. This ensures that no merges will happen across category boundaries. This was introduced in the GPT-2 paper and continues to be in use as of GPT-4.
+3. [bpe_gpt4.py](bpe_gpt4.py): Implements the `GPT4Tokenizer`. This class is a light wrapper around the `RegexTokenizer` (2, above) that exactly reproduces the tokenization of GPT-4 in the [tiktoken](https://github.com/openai/tiktoken) library. The wrapping handles some details around recovering the exact merges in the tokenizer, and the handling of some unfortunate and likely historical 1-byte token permutations. Note that the parity is not fully complete yet because we do not handle special tokens.
 
-Finally, the script [train.py](train.py) trains both of these tokenizers on the input text [taylorswift.txt](taylorswift.txt) (this is the Wikipedia entry for her kek) and saves the vocab to disk for visualization. This script runs in about 25 seconds on my (M1) MacBook.
+Finally, the script [train.py](train.py) trains the two major tokenizers on the input text [taylorswift.txt](taylorswift.txt) (this is the Wikipedia entry for her kek) and saves the vocab to disk for visualization. This script runs in about 25 seconds on my (M1) MacBook.
 
-### reproducing tiktoken GPT-4
+## todos
 
-The correctness of this code is also established by exactly reproducing the [tiktoken](https://github.com/openai/tiktoken) library, and its encoding/decoding with the GPT-4 tokenizer. In particular, we can take the `_mergeable_ranks` from the GPT4 tokenizer:
-
-```
-enc = tiktoken.get_encoding("cl100k_base")
-mergeable_ranks = enc._mergeable_ranks
-```
-
-And use them to construct a `RegexTokenizer` that will exactly reproduce the tokenization of GPT4. Run and step through the file [test_gpt4.py](test_gpt4.py) for details.
-
-Note that the parity is not complete because we do not handle special tokens.
+- handle special tokens (?)
+- save and load Tokenizers to/from disk
+- video coming soon ;)
 
 ## License
 

diff --git a/bpe_basic.py b/bpe_basic.py
@@ -39,7 +39,7 @@ def merge(ids, pair, idx):
     return newids
 
 
-class Tokenizer:
+class BasicTokenizer:
 
     def __init__(self):
         # by default, we have a vocab size of 256 (all bytes) and no merges
@@ -126,7 +126,7 @@ def encode(self, text):
     """
 
     text = "aaabdaaabac"
-    tokenizer = Tokenizer()
+    tokenizer = BasicTokenizer()
 
     # we do 3 merges
     tokenizer.train(text, 256 + 3)

diff --git a/bpe_gpt4.py b/bpe_gpt4.py
@@ -0,0 +1,105 @@
+"""
+Implements the GPT-4 Tokenizer with a light wrapper around the RegexTokenizer.
+"""
+
+import tiktoken
+from bpe_regex import RegexTokenizer
+
+
+def bpe(mergeable_ranks, token, max_rank):
+    # helper function used in get_gpt4_merges() to reconstruct the merge forest
+    parts = [bytes([b]) for b in token]
+    while True:
+        min_idx = None
+        min_rank = None
+        for i, pair in enumerate(zip(parts[:-1], parts[1:])):
+            rank = mergeable_ranks.get(pair[0] + pair[1])
+            if rank is not None and (min_rank is None or rank < min_rank):
+                min_idx = i
+                min_rank = rank
+        if min_rank is None or (max_rank is not None and min_rank >= max_rank):
+            break
+        assert min_idx is not None
+        parts = parts[:min_idx] + [parts[min_idx] + parts[min_idx + 1]] + parts[min_idx + 2:]
+    return parts
+
+
+def recover_merges(mergeable_ranks):
+    # the `merges` are already the byte sequences in their merged state.
+    # so we have to recover the original pairings. We can do this by doing
+    # a small BPE training run on all the tokens, in their order.
+    # also see https://github.com/openai/tiktoken/issues/60
+    merges = {}
+    for token, rank in mergeable_ranks.items():
+        if len(token) == 1:
+            continue # skip raw bytes
+        pair = tuple(bpe(mergeable_ranks, token, max_rank=rank))
+        assert len(pair) == 2
+        # recover the integer ranks of the pair
+        ix0 = mergeable_ranks[pair[0]]
+        ix1 = mergeable_ranks[pair[1]]
+        merges[(ix0, ix1)] = rank
+
+    return merges
+
+
+class GPT4Tokenizer(RegexTokenizer):
+    """Lightweight wrapper on RegexTokenizer that matches GPT-4's tokenizer."""
+
+    def __init__(self):
+        super().__init__()
+        # get the official tokenizer and its merges
+        enc = tiktoken.get_encoding("cl100k_base")
+        mergeable_ranks = enc._mergeable_ranks
+        # the merges are those of gpt4, but we have to recover them
+        self.merges = recover_merges(mergeable_ranks)
+        # reconstruct the vocab from the merges
+        vocab = {idx: bytes([idx]) for idx in range(256)}
+        for (p0, p1), idx in self.merges.items():
+            vocab[idx] = vocab[p0] + vocab[p1]
+        self.vocab = vocab
+        # now here is another tricky part.
+        # for some reason, the tokens corresponding to individual bytes
+        # are permuted in a different order. This is completely non-sensical
+        # and probably historical, but therefore we have to deal with it here.
+        self.byte_shuffle = {i: mergeable_ranks[bytes([i])] for i in range(256)}
+        self.inverse_byte_shuffle = {v: k for k, v in self.byte_shuffle.items()}
+
+    def _encode_chunk(self, text_bytes):
+        # before we start processing bytes, we have to permute them
+        text_bytes = bytes(self.byte_shuffle[b] for b in text_bytes)
+        ids = super()._encode_chunk(text_bytes)
+        return ids
+
+    def decode(self, ids):
+        # we have to un-permute the bytes before we decode
+        text_bytes = b"".join(self.vocab[idx] for idx in ids)
+        text_bytes = bytes(self.inverse_byte_shuffle[b] for b in text_bytes)
+        text = text_bytes.decode("utf-8", errors="replace")
+        return text
+
+if __name__ == "__main__":
+    # let's take it for a spin!
+
+    # tiktoken
+    enc = tiktoken.get_encoding("cl100k_base")
+    # vs.
+    tokenizer = GPT4Tokenizer()
+    # fight!
+
+    text = "hello world!!!? (안녕하세요!) lol123 😉"
+    print(text)
+    print(enc.encode(text)) # tiktoken
+    print(tokenizer.encode(text)) # ours
+    print(tokenizer.decode(tokenizer.encode(text))) # ours back to text
+
+    # two quick tests: equality (to tiktoken) and identity
+    print("OK" if enc.encode(text) == tokenizer.encode(text) else "FAIL")
+    print("OK" if text == tokenizer.decode(tokenizer.encode(text)) else "FAIL")
+
+    # let's also tokenize all of taylor swift, a bigger document just to make sure
+    text = open("taylorswift.txt", "r", encoding="utf-8").read()
+    t1 = enc.encode(text) # tiktoken
+    t2 = tokenizer.encode(text) # ours
+    print("OK" if t1 == t2 else "FAIL")
+    print("OK" if text == tokenizer.decode(tokenizer.encode(text)) else "FAIL")
diff --git a/bpe_regex.py b/bpe_regex.py
@@ -48,16 +48,13 @@ def merge(ids, pair, idx):
     return newids
 
 
-class Tokenizer:
+class RegexTokenizer:
 
     def __init__(self):
         # default to vocab size of 256 (all bytes), no merges and gpt-4 pattern
         self.merges = {}
         self.vocab = {idx: bytes([idx]) for idx in range(256)}
         self.pattern = re.compile(GPT4_SPLIT_PATTERN)
-        # ugly optional part, only needed for GPT-4 tiktoken compatibility
-        self.byte_shuffle = None
-        self.inverse_byte_shuffle = None
 
     def train(self, text, vocab_size, verbose=False):
         assert vocab_size >= 256
@@ -100,17 +97,11 @@ def train(self, text, vocab_size, verbose=False):
     def decode(self, ids):
         # given ids (list of integers), return Python string
         text_bytes = b"".join(self.vocab[idx] for idx in ids)
-        if self.inverse_byte_shuffle is not None:
-            text_bytes = bytes(self.inverse_byte_shuffle[b] for b in text_bytes)
         text = text_bytes.decode("utf-8", errors="replace")
         return text
 
-    def _encode_chunk(self, text):
-        # given a string text, return the token ids
-        text_bytes = text.encode("utf-8") # raw bytes
-        # needed to repro GPT-4 because OpenAI shuffles its 1-byte tokens order
-        if self.byte_shuffle is not None:
-            text_bytes = bytes(self.byte_shuffle[b] for b in text_bytes)
+    def _encode_chunk(self, text_bytes):
+        # return the token ids
         # let's begin. first, convert all bytes to integers in range 0..255
         ids = list(text_bytes)
         while len(ids) >= 2:
@@ -134,7 +125,8 @@ def encode(self, text):
         # all chunks of text are encoded separately, then results are joined
         ids = []
         for chunk in text_chunks:
-            chunk_ids = self._encode_chunk(chunk)
+            chunk_bytes = chunk.encode("utf-8") # raw bytes
+            chunk_ids = self._encode_chunk(chunk_bytes)
             ids.extend(chunk_ids)
         return ids
 
@@ -163,7 +155,7 @@ def encode(self, text):
     """
 
     text = "aaabdaaabac"
-    tokenizer = Tokenizer()
+    tokenizer = RegexTokenizer()
 
     # we do 3 merges
     tokenizer.train(text, 256 + 3)

diff --git a/test_gpt4.py b/test_gpt4.py