Skip to content

Commit

Permalink
major refactor to make this a sensible package with nice tests etc. w…
Browse files Browse the repository at this point in the history
…e almost look legitimate now
  • Loading branch information
karpathy committed Feb 18, 2024
1 parent d278867 commit ac3ca85
Show file tree
Hide file tree
Showing 10 changed files with 28 additions and 21 deletions.
25 changes: 15 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,19 +6,19 @@ This algorithm was popularized for LLMs by the [GPT-2 paper](https://d4mucfpksyw

There are two Tokenizers in this repository, both of which can perform the 3 primary functions of a Tokenizer: 1) train the tokenizer vocabulary and merges on a given text, 2) encode from text to tokens, 3) decode from tokens to text. The files of the repo are as follows:

1. [bpe_base.py](bpe_base.py): Implements the `Tokenizer` class, which is the base class. It contains the `train`, `encode`, and `decode` stubs, save/load functionality, and there are also a few common utility functions. This class is not meant to be used directly, but rather to be inherited from.
2. [bpe_basic.py](bpe_basic.py): Implements the `BasicTokenizer`, the simplest implementation of the BPE algorithm that runs directly on text.
3. [bpe_regex.py](bpe_regex.py): Implements the `RegexTokenizer` that further splits the input text by a regex pattern, which is a preprocessing stage that splits up the input text by categories (think: letters, numbers, punctuation) before tokenization. This ensures that no merges will happen across category boundaries. This was introduced in the GPT-2 paper and continues to be in use as of GPT-4.
4. [bpe_gpt4.py](bpe_gpt4.py): Implements the `GPT4Tokenizer`. This class is a light wrapper around the `RegexTokenizer` (2, above) that exactly reproduces the tokenization of GPT-4 in the [tiktoken](https://github.com/openai/tiktoken) library. The wrapping handles some details around recovering the exact merges in the tokenizer, and the handling of some unfortunate (and likely historical?) 1-byte token permutations. Note that the parity is not fully complete yet because we do not handle special tokens.
1. [minbpe/base.py](minbpe/base.py): Implements the `Tokenizer` class, which is the base class. It contains the `train`, `encode`, and `decode` stubs, save/load functionality, and there are also a few common utility functions. This class is not meant to be used directly, but rather to be inherited from.
2. [minbpe/basic.py](minbpe/basic.py): Implements the `BasicTokenizer`, the simplest implementation of the BPE algorithm that runs directly on text.
3. [minbpe/regex.py](minbpe/regex.py): Implements the `RegexTokenizer` that further splits the input text by a regex pattern, which is a preprocessing stage that splits up the input text by categories (think: letters, numbers, punctuation) before tokenization. This ensures that no merges will happen across category boundaries. This was introduced in the GPT-2 paper and continues to be in use as of GPT-4.
4. [minbpe/gpt4.py](minbpe/gpt4.py): Implements the `GPT4Tokenizer`. This class is a light wrapper around the `RegexTokenizer` (2, above) that exactly reproduces the tokenization of GPT-4 in the [tiktoken](https://github.com/openai/tiktoken) library. The wrapping handles some details around recovering the exact merges in the tokenizer, and the handling of some unfortunate (and likely historical?) 1-byte token permutations. Note that the parity is not fully complete yet because we do not handle special tokens.

Finally, the script [train.py](train.py) trains the two major tokenizers on the input text [taylorswift.txt](taylorswift.txt) (this is the Wikipedia entry for her kek) and saves the vocab to disk for visualization. This script runs in about 25 seconds on my (M1) MacBook.
Finally, the script [train.py](train.py) trains the two major tokenizers on the input text [tests/taylorswift.txt](tests/taylorswift.txt) (this is the Wikipedia entry for her kek) and saves the vocab to disk for visualization. This script runs in about 25 seconds on my (M1) MacBook.

## usage

All of the files above are very short and thoroughly commented, and also contain a usage example on the bottom of the file. As a quick example, following along the [Wikipedia article on BPE](https://en.wikipedia.org/wiki/Byte_pair_encoding), we can reproduce it as follows:

```python
from bpe_basic import BasicTokenizer
from minbpe import BasicTokenizer
tokenizer = BasicTokenizer()
text = "aaabdaaabac"
tokenizer.train(text, 256 + 3) # 256 are the byte tokens, then do 3 merges
Expand All @@ -30,7 +30,7 @@ tokenizer.save("toy")
# writes two files: toy.model (for loading) and toy.vocab (for viewing)
```

The result above is exactly as expected, please see bottom of [bpe_basic](bpe_basic.py) for more details. To use the `GPT4Tokenizer`, simple example and comparison to [tiktoken](https://github.com/openai/tiktoken):
The result above is exactly as expected, please see bottom of [minbpe/basic.py](minbpe/basic.py) for more details. To use the `GPT4Tokenizer`, simple example and comparison to [tiktoken](https://github.com/openai/tiktoken):

```python
text = "hello123!!!? (안녕하세요!) 😉"
Expand All @@ -42,7 +42,7 @@ print(enc.encode(text))
# [15339, 4513, 12340, 30, 320, 31495, 230, 75265, 243, 92245, 16715, 57037]

# ours
from bpe_gpt4 import GPT4Tokenizer
from minbpe import GPT4Tokenizer
tokenizer = GPT4Tokenizer()
print(tokenizer.encode(text))
# [15339, 4513, 12340, 30, 320, 31495, 230, 75265, 243, 92245, 16715, 57037]
Expand All @@ -52,11 +52,16 @@ print(tokenizer.encode(text))

## tests

The unit tests use pytest. First `pip install pytest` if you haven't already, then `pytest .` to run.
We use the pytest library for tests. All of them are located in the `tests/` directory. First `pip install pytest` if you haven't already, then:

```bash
$ pytest .
```

to run the tests.

## todos

- move the files into minbpe directory / make a nice small package?
- write more optimized versions, both in Python and/or C/Rust?
- handle special tokens? think through...
- video coming soon ;)
Expand Down
4 changes: 4 additions & 0 deletions minbpe/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
from .base import Tokenizer
from .basic import BasicTokenizer
from .regex import RegexTokenizer
from .gpt4 import GPT4Tokenizer
File renamed without changes.
2 changes: 1 addition & 1 deletion bpe_basic.py → minbpe/basic.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
- Does not handle any special tokens.
"""

from bpe_base import Tokenizer, get_stats, merge
from .base import Tokenizer, get_stats, merge


class BasicTokenizer(Tokenizer):
Expand Down
2 changes: 1 addition & 1 deletion bpe_gpt4.py → minbpe/gpt4.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
"""

import tiktoken
from bpe_regex import RegexTokenizer
from .regex import RegexTokenizer


def bpe(mergeable_ranks, token, max_rank):
Expand Down
2 changes: 1 addition & 1 deletion bpe_regex.py → minbpe/regex.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
"""

import regex as re
from bpe_base import Tokenizer, get_stats, merge
from .base import Tokenizer, get_stats, merge


# the main GPT text split patterns, see
Expand Down
Empty file added tests/__init__.py
Empty file.
File renamed without changes.
8 changes: 4 additions & 4 deletions test_tokenizer.py → tests/test_tokenizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,16 +2,15 @@
import tiktoken
import os

from bpe_basic import BasicTokenizer
from bpe_gpt4 import GPT4Tokenizer
from bpe_regex import RegexTokenizer
from minbpe import BasicTokenizer, RegexTokenizer, GPT4Tokenizer

# a few strings to test the tokenizers on
taylorswift_file = os.path.join(os.path.dirname(os.path.abspath(__file__)), "taylorswift.txt")
test_strings = [
"", # empty string
"?", # single character
"hello world!!!? (안녕하세요!) lol123 😉", # fun small string
open("taylorswift.txt", "r", encoding="utf-8").read(), # big string
open(taylorswift_file, "r", encoding="utf-8").read(), # big string
]

# test encode/decode identity for a few different strings
Expand Down Expand Up @@ -82,6 +81,7 @@ def test_save_load():
# verify that save/load work as expected
ids = tokenizer.encode(text)

# TODO use a proper temporary directory for I/O things below
# save the tokenizer
tokenizer.save("test_tokenizer_tmp")

Expand Down
6 changes: 2 additions & 4 deletions train.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,10 @@
The whole thing runs in ~25 seconds on my laptop.
"""

# feel free to use either
from bpe_regex import RegexTokenizer
from bpe_basic import BasicTokenizer
from minbpe import BasicTokenizer, RegexTokenizer

# open some text and train a vocab of 512 tokens
text = open("taylorswift.txt", "r", encoding="utf-8").read()
text = open("tests/taylorswift.txt", "r", encoding="utf-8").read()

for TokenizerClass, name in zip([BasicTokenizer, RegexTokenizer], ["basic", "regex"]):

Expand Down

0 comments on commit ac3ca85

Please sign in to comment.