Skip to content

Commit

Permalink
bunch of changes that just make sense
Browse files Browse the repository at this point in the history
  • Loading branch information
karpathy committed Feb 16, 2024
1 parent 1c0520f commit 43ffc2f
Show file tree
Hide file tree
Showing 5 changed files with 123 additions and 117 deletions.
26 changes: 10 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,27 +2,21 @@

Minimal, clean, educational code for the (byte-level) Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization. The BPE algorithm is "byte-level" because it runs on UTF-8 encoded strings.

This algorithm was popularized for LLMs by the [GPT-2 paper](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) and the associated GPT-2 [code release](https://github.com/openai/gpt-2) from OpenAI. Today, all modern LLMs (e.g. GPT, Llama, Mistral) use this algorithm to train their tokenizers.
This algorithm was popularized for LLMs by the [GPT-2 paper](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) and the associated GPT-2 [code release](https://github.com/openai/gpt-2) from OpenAI. [Sennrich et al. 2015](https://arxiv.org/abs/1508.07909) is cited as the original reference for the use of BPE in NLP applications. Today, all modern LLMs (e.g. GPT, Llama, Mistral) use this algorithm to train their tokenizers.

There are two Tokenizers in this repository, both of which can perform the 3 primary functions of a Tokenizer: 1) train the tokenizer vocabulary and merges on a given text, 2) encode from text to tokens, 3) decode from tokens to text. The two tokenizers are:
There are two primary Tokenizers in this repository, both of which can perform the 3 primary functions of a Tokenizer: 1) train the tokenizer vocabulary and merges on a given text, 2) encode from text to tokens, 3) decode from tokens to text. The two tokenizers are:

1. [bpe_basic.py](bpe_basic.py): The simplest implementation of the BPE algorithm that runs directly on text.
2. [bpe_regex.py](bpe_regex.py): This implementation further splits the input text by a regex pattern, which is a preprocessing stage that splits up the input text by categories (think: letters, numbers, punctuation) before tokenization. This ensures that no merges will happen across category boundaries. This was introduced in the GPT-2 paper and continues to be in use as of GPT-4.
1. [bpe_basic.py](bpe_basic.py): Implements the `BasicTokenizer`, the simplest implementation of the BPE algorithm that runs directly on text.
2. [bpe_regex.py](bpe_regex.py): Implements the `RegexTokenizer` that further splits the input text by a regex pattern, which is a preprocessing stage that splits up the input text by categories (think: letters, numbers, punctuation) before tokenization. This ensures that no merges will happen across category boundaries. This was introduced in the GPT-2 paper and continues to be in use as of GPT-4.
3. [bpe_gpt4.py](bpe_gpt4.py): Implements the `GPT4Tokenizer`. This class is a light wrapper around the `RegexTokenizer` (2, above) that exactly reproduces the tokenization of GPT-4 in the [tiktoken](https://github.com/openai/tiktoken) library. The wrapping handles some details around recovering the exact merges in the tokenizer, and the handling of some unfortunate and likely historical 1-byte token permutations. Note that the parity is not fully complete yet because we do not handle special tokens.

Finally, the script [train.py](train.py) trains both of these tokenizers on the input text [taylorswift.txt](taylorswift.txt) (this is the Wikipedia entry for her kek) and saves the vocab to disk for visualization. This script runs in about 25 seconds on my (M1) MacBook.
Finally, the script [train.py](train.py) trains the two major tokenizers on the input text [taylorswift.txt](taylorswift.txt) (this is the Wikipedia entry for her kek) and saves the vocab to disk for visualization. This script runs in about 25 seconds on my (M1) MacBook.

### reproducing tiktoken GPT-4
## todos

The correctness of this code is also established by exactly reproducing the [tiktoken](https://github.com/openai/tiktoken) library, and its encoding/decoding with the GPT-4 tokenizer. In particular, we can take the `_mergeable_ranks` from the GPT4 tokenizer:

```
enc = tiktoken.get_encoding("cl100k_base")
mergeable_ranks = enc._mergeable_ranks
```

And use them to construct a `RegexTokenizer` that will exactly reproduce the tokenization of GPT4. Run and step through the file [test_gpt4.py](test_gpt4.py) for details.

Note that the parity is not complete because we do not handle special tokens.
- handle special tokens (?)
- save and load Tokenizers to/from disk
- video coming soon ;)

## License

Expand Down
4 changes: 2 additions & 2 deletions bpe_basic.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ def merge(ids, pair, idx):
return newids


class Tokenizer:
class BasicTokenizer:

def __init__(self):
# by default, we have a vocab size of 256 (all bytes) and no merges
Expand Down Expand Up @@ -126,7 +126,7 @@ def encode(self, text):
"""

text = "aaabdaaabac"
tokenizer = Tokenizer()
tokenizer = BasicTokenizer()

# we do 3 merges
tokenizer.train(text, 256 + 3)
Expand Down
105 changes: 105 additions & 0 deletions bpe_gpt4.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
"""
Implements the GPT-4 Tokenizer with a light wrapper around the RegexTokenizer.
"""

import tiktoken
from bpe_regex import RegexTokenizer


def bpe(mergeable_ranks, token, max_rank):
# helper function used in get_gpt4_merges() to reconstruct the merge forest
parts = [bytes([b]) for b in token]
while True:
min_idx = None
min_rank = None
for i, pair in enumerate(zip(parts[:-1], parts[1:])):
rank = mergeable_ranks.get(pair[0] + pair[1])
if rank is not None and (min_rank is None or rank < min_rank):
min_idx = i
min_rank = rank
if min_rank is None or (max_rank is not None and min_rank >= max_rank):
break
assert min_idx is not None
parts = parts[:min_idx] + [parts[min_idx] + parts[min_idx + 1]] + parts[min_idx + 2:]
return parts


def recover_merges(mergeable_ranks):
# the `merges` are already the byte sequences in their merged state.
# so we have to recover the original pairings. We can do this by doing
# a small BPE training run on all the tokens, in their order.
# also see https://github.com/openai/tiktoken/issues/60
merges = {}
for token, rank in mergeable_ranks.items():
if len(token) == 1:
continue # skip raw bytes
pair = tuple(bpe(mergeable_ranks, token, max_rank=rank))
assert len(pair) == 2
# recover the integer ranks of the pair
ix0 = mergeable_ranks[pair[0]]
ix1 = mergeable_ranks[pair[1]]
merges[(ix0, ix1)] = rank

return merges


class GPT4Tokenizer(RegexTokenizer):
"""Lightweight wrapper on RegexTokenizer that matches GPT-4's tokenizer."""

def __init__(self):
super().__init__()
# get the official tokenizer and its merges
enc = tiktoken.get_encoding("cl100k_base")
mergeable_ranks = enc._mergeable_ranks
# the merges are those of gpt4, but we have to recover them
self.merges = recover_merges(mergeable_ranks)
# reconstruct the vocab from the merges
vocab = {idx: bytes([idx]) for idx in range(256)}
for (p0, p1), idx in self.merges.items():
vocab[idx] = vocab[p0] + vocab[p1]
self.vocab = vocab
# now here is another tricky part.
# for some reason, the tokens corresponding to individual bytes
# are permuted in a different order. This is completely non-sensical
# and probably historical, but therefore we have to deal with it here.
self.byte_shuffle = {i: mergeable_ranks[bytes([i])] for i in range(256)}
self.inverse_byte_shuffle = {v: k for k, v in self.byte_shuffle.items()}

def _encode_chunk(self, text_bytes):
# before we start processing bytes, we have to permute them
text_bytes = bytes(self.byte_shuffle[b] for b in text_bytes)
ids = super()._encode_chunk(text_bytes)
return ids

def decode(self, ids):
# we have to un-permute the bytes before we decode
text_bytes = b"".join(self.vocab[idx] for idx in ids)
text_bytes = bytes(self.inverse_byte_shuffle[b] for b in text_bytes)
text = text_bytes.decode("utf-8", errors="replace")
return text

if __name__ == "__main__":
# let's take it for a spin!

# tiktoken
enc = tiktoken.get_encoding("cl100k_base")
# vs.
tokenizer = GPT4Tokenizer()
# fight!

text = "hello world!!!? (안녕하세요!) lol123 😉"
print(text)
print(enc.encode(text)) # tiktoken
print(tokenizer.encode(text)) # ours
print(tokenizer.decode(tokenizer.encode(text))) # ours back to text

# two quick tests: equality (to tiktoken) and identity
print("OK" if enc.encode(text) == tokenizer.encode(text) else "FAIL")
print("OK" if text == tokenizer.decode(tokenizer.encode(text)) else "FAIL")

# let's also tokenize all of taylor swift, a bigger document just to make sure
text = open("taylorswift.txt", "r", encoding="utf-8").read()
t1 = enc.encode(text) # tiktoken
t2 = tokenizer.encode(text) # ours
print("OK" if t1 == t2 else "FAIL")
print("OK" if text == tokenizer.decode(tokenizer.encode(text)) else "FAIL")
20 changes: 6 additions & 14 deletions bpe_regex.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,16 +48,13 @@ def merge(ids, pair, idx):
return newids


class Tokenizer:
class RegexTokenizer:

def __init__(self):
# default to vocab size of 256 (all bytes), no merges and gpt-4 pattern
self.merges = {}
self.vocab = {idx: bytes([idx]) for idx in range(256)}
self.pattern = re.compile(GPT4_SPLIT_PATTERN)
# ugly optional part, only needed for GPT-4 tiktoken compatibility
self.byte_shuffle = None
self.inverse_byte_shuffle = None

def train(self, text, vocab_size, verbose=False):
assert vocab_size >= 256
Expand Down Expand Up @@ -100,17 +97,11 @@ def train(self, text, vocab_size, verbose=False):
def decode(self, ids):
# given ids (list of integers), return Python string
text_bytes = b"".join(self.vocab[idx] for idx in ids)
if self.inverse_byte_shuffle is not None:
text_bytes = bytes(self.inverse_byte_shuffle[b] for b in text_bytes)
text = text_bytes.decode("utf-8", errors="replace")
return text

def _encode_chunk(self, text):
# given a string text, return the token ids
text_bytes = text.encode("utf-8") # raw bytes
# needed to repro GPT-4 because OpenAI shuffles its 1-byte tokens order
if self.byte_shuffle is not None:
text_bytes = bytes(self.byte_shuffle[b] for b in text_bytes)
def _encode_chunk(self, text_bytes):
# return the token ids
# let's begin. first, convert all bytes to integers in range 0..255
ids = list(text_bytes)
while len(ids) >= 2:
Expand All @@ -134,7 +125,8 @@ def encode(self, text):
# all chunks of text are encoded separately, then results are joined
ids = []
for chunk in text_chunks:
chunk_ids = self._encode_chunk(chunk)
chunk_bytes = chunk.encode("utf-8") # raw bytes
chunk_ids = self._encode_chunk(chunk_bytes)
ids.extend(chunk_ids)
return ids

Expand Down Expand Up @@ -163,7 +155,7 @@ def encode(self, text):
"""

text = "aaabdaaabac"
tokenizer = Tokenizer()
tokenizer = RegexTokenizer()

# we do 3 merges
tokenizer.train(text, 256 + 3)
Expand Down
85 changes: 0 additions & 85 deletions test_gpt4.py

This file was deleted.

0 comments on commit 43ffc2f

Please sign in to comment.