major refactor to make this a sensible package with nice tests etc. w…

…e almost look legitimate now
saber5433 · Feb 18, 2024 · ac3ca85 · ac3ca85
1 parent d278867
commit ac3ca85
Show file tree

Hide file tree

Showing 10 changed files with 28 additions and 21 deletions.
diff --git a/README.md b/README.md
@@ -6,19 +6,19 @@ This algorithm was popularized for LLMs by the [GPT-2 paper](https://d4mucfpksyw
 
 There are two Tokenizers in this repository, both of which can perform the 3 primary functions of a Tokenizer: 1) train the tokenizer vocabulary and merges on a given text, 2) encode from text to tokens, 3) decode from tokens to text. The files of the repo are as follows:
 
-1. [bpe_base.py](bpe_base.py): Implements the `Tokenizer` class, which is the base class. It contains the `train`, `encode`, and `decode` stubs, save/load functionality, and there are also a few common utility functions. This class is not meant to be used directly, but rather to be inherited from.
-2. [bpe_basic.py](bpe_basic.py): Implements the `BasicTokenizer`, the simplest implementation of the BPE algorithm that runs directly on text.
-3. [bpe_regex.py](bpe_regex.py): Implements the `RegexTokenizer` that further splits the input text by a regex pattern, which is a preprocessing stage that splits up the input text by categories (think: letters, numbers, punctuation) before tokenization. This ensures that no merges will happen across category boundaries. This was introduced in the GPT-2 paper and continues to be in use as of GPT-4.
-4. [bpe_gpt4.py](bpe_gpt4.py): Implements the `GPT4Tokenizer`. This class is a light wrapper around the `RegexTokenizer` (2, above) that exactly reproduces the tokenization of GPT-4 in the [tiktoken](https://github.com/openai/tiktoken) library. The wrapping handles some details around recovering the exact merges in the tokenizer, and the handling of some unfortunate (and likely historical?) 1-byte token permutations. Note that the parity is not fully complete yet because we do not handle special tokens.
+1. [minbpe/base.py](minbpe/base.py): Implements the `Tokenizer` class, which is the base class. It contains the `train`, `encode`, and `decode` stubs, save/load functionality, and there are also a few common utility functions. This class is not meant to be used directly, but rather to be inherited from.
+2. [minbpe/basic.py](minbpe/basic.py): Implements the `BasicTokenizer`, the simplest implementation of the BPE algorithm that runs directly on text.
+3. [minbpe/regex.py](minbpe/regex.py): Implements the `RegexTokenizer` that further splits the input text by a regex pattern, which is a preprocessing stage that splits up the input text by categories (think: letters, numbers, punctuation) before tokenization. This ensures that no merges will happen across category boundaries. This was introduced in the GPT-2 paper and continues to be in use as of GPT-4.
+4. [minbpe/gpt4.py](minbpe/gpt4.py): Implements the `GPT4Tokenizer`. This class is a light wrapper around the `RegexTokenizer` (2, above) that exactly reproduces the tokenization of GPT-4 in the [tiktoken](https://github.com/openai/tiktoken) library. The wrapping handles some details around recovering the exact merges in the tokenizer, and the handling of some unfortunate (and likely historical?) 1-byte token permutations. Note that the parity is not fully complete yet because we do not handle special tokens.
 
-Finally, the script [train.py](train.py) trains the two major tokenizers on the input text [taylorswift.txt](taylorswift.txt) (this is the Wikipedia entry for her kek) and saves the vocab to disk for visualization. This script runs in about 25 seconds on my (M1) MacBook.
+Finally, the script [train.py](train.py) trains the two major tokenizers on the input text [tests/taylorswift.txt](tests/taylorswift.txt) (this is the Wikipedia entry for her kek) and saves the vocab to disk for visualization. This script runs in about 25 seconds on my (M1) MacBook.
 
 ## usage
 
 All of the files above are very short and thoroughly commented, and also contain a usage example on the bottom of the file. As a quick example, following along the [Wikipedia article on BPE](https://en.wikipedia.org/wiki/Byte_pair_encoding), we can reproduce it as follows:
 
 ```python
-from bpe_basic import BasicTokenizer
+from minbpe import BasicTokenizer
 tokenizer = BasicTokenizer()
 text = "aaabdaaabac"
 tokenizer.train(text, 256 + 3) # 256 are the byte tokens, then do 3 merges
@@ -30,7 +30,7 @@ tokenizer.save("toy")
 # writes two files: toy.model (for loading) and toy.vocab (for viewing)
 ```
 
-The result above is exactly as expected, please see bottom of [bpe_basic](bpe_basic.py) for more details. To use the `GPT4Tokenizer`, simple example and comparison to [tiktoken](https://github.com/openai/tiktoken):
+The result above is exactly as expected, please see bottom of [minbpe/basic.py](minbpe/basic.py) for more details. To use the `GPT4Tokenizer`, simple example and comparison to [tiktoken](https://github.com/openai/tiktoken):
 
 ```python
 text = "hello123!!!? (안녕하세요!) 😉"
@@ -42,7 +42,7 @@ print(enc.encode(text))
 # [15339, 4513, 12340, 30, 320, 31495, 230, 75265, 243, 92245, 16715, 57037]
 
 # ours
-from bpe_gpt4 import GPT4Tokenizer
+from minbpe import GPT4Tokenizer
 tokenizer = GPT4Tokenizer()
 print(tokenizer.encode(text))
 # [15339, 4513, 12340, 30, 320, 31495, 230, 75265, 243, 92245, 16715, 57037]
@@ -52,11 +52,16 @@ print(tokenizer.encode(text))
 
 ## tests
 
-The unit tests use pytest. First `pip install pytest` if you haven't already, then `pytest .` to run.
+We use the pytest library for tests. All of them are located in the `tests/` directory. First `pip install pytest` if you haven't already, then:
+
+```bash
+$ pytest .
+```
+
+to run the tests.
 
 ## todos
 
-- move the files into minbpe directory / make a nice small package?
 - write more optimized versions, both in Python and/or C/Rust?
 - handle special tokens? think through...
 - video coming soon ;)

diff --git a/minbpe/__init__.py b/minbpe/__init__.py
@@ -0,0 +1,4 @@
+from .base import Tokenizer
+from .basic import BasicTokenizer
+from .regex import RegexTokenizer
+from .gpt4 import GPT4Tokenizer
diff --git a/bpe_base.py → minbpe/base.py b/bpe_base.py → minbpe/base.py
diff --git a/bpe_basic.py → minbpe/basic.py b/bpe_basic.py → minbpe/basic.py
@@ -9,7 +9,7 @@
 - Does not handle any special tokens.
 """
 
-from bpe_base import Tokenizer, get_stats, merge
+from .base import Tokenizer, get_stats, merge
 
 
 class BasicTokenizer(Tokenizer):

diff --git a/bpe_gpt4.py → minbpe/gpt4.py b/bpe_gpt4.py → minbpe/gpt4.py
@@ -3,7 +3,7 @@
 """
 
 import tiktoken
-from bpe_regex import RegexTokenizer
+from .regex import RegexTokenizer
 
 
 def bpe(mergeable_ranks, token, max_rank):

diff --git a/bpe_regex.py → minbpe/regex.py b/bpe_regex.py → minbpe/regex.py
@@ -11,7 +11,7 @@
 """
 
 import regex as re
-from bpe_base import Tokenizer, get_stats, merge
+from .base import Tokenizer, get_stats, merge
 
 
 # the main GPT text split patterns, see

diff --git a/tests/__init__.py b/tests/__init__.py
diff --git a/taylorswift.txt → tests/taylorswift.txt b/taylorswift.txt → tests/taylorswift.txt
diff --git a/test_tokenizer.py → tests/test_tokenizer.py b/test_tokenizer.py → tests/test_tokenizer.py
@@ -2,16 +2,15 @@
 import tiktoken
 import os
 
-from bpe_basic import BasicTokenizer
-from bpe_gpt4 import GPT4Tokenizer
-from bpe_regex import RegexTokenizer
+from minbpe import BasicTokenizer, RegexTokenizer, GPT4Tokenizer
 
 # a few strings to test the tokenizers on
+taylorswift_file = os.path.join(os.path.dirname(os.path.abspath(__file__)), "taylorswift.txt")
 test_strings = [
     "", # empty string
     "?", # single character
     "hello world!!!? (안녕하세요!) lol123 😉", # fun small string
-    open("taylorswift.txt", "r", encoding="utf-8").read(), # big string
+    open(taylorswift_file, "r", encoding="utf-8").read(), # big string
 ]
 
 # test encode/decode identity for a few different strings
@@ -82,6 +81,7 @@ def test_save_load():
     # verify that save/load work as expected
     ids = tokenizer.encode(text)
 
+    # TODO use a proper temporary directory for I/O things below
     # save the tokenizer
     tokenizer.save("test_tokenizer_tmp")
 

diff --git a/train.py b/train.py
@@ -3,12 +3,10 @@
 The whole thing runs in ~25 seconds on my laptop.
 """
 
-# feel free to use either
-from bpe_regex import RegexTokenizer
-from bpe_basic import BasicTokenizer
+from minbpe import BasicTokenizer, RegexTokenizer
 
 # open some text and train a vocab of 512 tokens
-text = open("taylorswift.txt", "r", encoding="utf-8").read()
+text = open("tests/taylorswift.txt", "r", encoding="utf-8").read()
 
 for TokenizerClass, name in zip([BasicTokenizer, RegexTokenizer], ["basic", "regex"]):