Skip to content

Commit 4ccf644

Browse files
committed
refactor the readme file and adjust the API on how special tokens are handled
1 parent 37b63c2 commit 4ccf644

4 files changed

Lines changed: 67 additions & 10 deletions

File tree

README.md

Lines changed: 53 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -13,9 +13,11 @@ There are two Tokenizers in this repository, both of which can perform the 3 pri
1313

1414
Finally, the script [train.py](train.py) trains the two major tokenizers on the input text [tests/taylorswift.txt](tests/taylorswift.txt) (this is the Wikipedia entry for her kek) and saves the vocab to disk for visualization. This script runs in about 25 seconds on my (M1) MacBook.
1515

16-
## usage
16+
All of the files above are very short and thoroughly commented, and also contain a usage example on the bottom of the file.
1717

18-
All of the files above are very short and thoroughly commented, and also contain a usage example on the bottom of the file. As a quick example, following along the [Wikipedia article on BPE](https://en.wikipedia.org/wiki/Byte_pair_encoding), we can reproduce it as follows:
18+
## quick start
19+
20+
As the simplest example, we can reproduce the [Wikipedia article on BPE](https://en.wikipedia.org/wiki/Byte_pair_encoding) as follows:
1921

2022
```python
2123
from minbpe import BasicTokenizer
@@ -30,7 +32,11 @@ tokenizer.save("toy")
3032
# writes two files: toy.model (for loading) and toy.vocab (for viewing)
3133
```
3234

33-
The result above is exactly as expected, please see bottom of [minbpe/basic.py](minbpe/basic.py) for more details. To use the `GPT4Tokenizer`, simple example and comparison to [tiktoken](https://github.com/openai/tiktoken):
35+
According to Wikipedia, running bpe on the input string: "aaabdaaabac" for 3 merges results in the string: "XdXac" where X=ZY, Y=ab, and Z=aa. The tricky thing to note is that minbpe always allocates the 256 individual bytes as tokens, and then merges bytes as needed from there. So for us a=97, b=98, c=99, d=100 (their [ASCII](https://www.asciitable.com) values). Then when (a,a) is merged to Z, Z will become 256. Likewise Y will become 257 and X 258. So we start with the 256 bytes, and do 3 merges to get to the result above, with the expected output of [258, 100, 258, 97, 99].
36+
37+
## inference: GPT-4 comparison
38+
39+
We can verify that the `RegexTokenizer` has feature parity with the GPT-4 tokenizer from [tiktoken](https://github.com/openai/tiktoken) as follows:
3440

3541
```python
3642
text = "hello123!!!? (안녕하세요!) 😉"
@@ -48,7 +54,49 @@ print(tokenizer.encode(text))
4854
# [15339, 4513, 12340, 30, 320, 31495, 230, 75265, 243, 92245, 16715, 57037]
4955
```
5056

51-
(you'll have to `pip install tiktoken` to run).
57+
(you'll have to `pip install tiktoken` to run). Under the hood, the `GPT4Tokenizer` is just a light wrapper around `RegexTokenizer`, passing in the merges and the special tokens of GPT-4.
58+
59+
## training
60+
61+
Unlike tiktoken, this code allows you to train your own tokenizer. In principle and to my knowledge, if you train the `RegexTokenizer` on a large dataset with a vocabulary size of 100K, you would reproduce the GPT-4 tokenizer.
62+
63+
There are two paths you can follow. First, you can decide that you don't want the complexity of splitting and preprocessing text with regex patterns, and you also don't care for special tokens. In that case, reach for the `BasicTokenizer`. You can train it, and then encode and decode for example as follows:
64+
65+
```python
66+
from minbpe import BasicTokenizer
67+
tokenizer = BasicTokenizer()
68+
tokenizer.train(very_long_training_string, vocab_size=4096)
69+
tokenizer.encode("hello world") # string -> tokens
70+
tokenizer.decode([1000, 2000, 3000]) # tokens -> string
71+
tokenizer.save("mymodel") # writes mymodel.model and mymodel.vocab
72+
tokenizer.load("mymodel.model") # loads the model back, the vocab is just for vis
73+
```
74+
75+
If you instead want to follow along with OpenAI did for their text tokenizer, it's a good idea to adopt their approach of using regex pattern to split the text by categories. The GPT-4 pattern is a default with the `RegexTokenizer`, so you'd simple do something like:
76+
77+
```python
78+
from minbpe import RegexTokenizer
79+
tokenizer = RegexTokenizer()
80+
tokenizer.train(very_long_training_string, vocab_size=32768)
81+
tokenizer.encode("hello world") # string -> tokens
82+
tokenizer.decode([1000, 2000, 3000]) # tokens -> string
83+
tokenizer.save("tok32k") # writes tok32k.model and tok32k.vocab
84+
tokenizer.load("tok32k.model") # loads the model back from disk
85+
```
86+
87+
Where, of course, you'd want to change around the vocabulary size depending on the size of your dataset.
88+
89+
**Special tokens**. Finally, you might wish to add special tokens to your tokenizer. Register these using the `register_special_tokens` function. For example if you train with vocab_size of 32768, then the first 256 tokens are raw byte tokens, the next 32768-256 are merge tokens, and after those you can add the special tokens. The last "real" merge token will have id of 32767 (vocab_size - 1), so your first special token should come right after that, with an id of exactly 32768. So:
90+
91+
```python
92+
from minbpe import RegexTokenizer
93+
tokenizer = RegexTokenizer()
94+
tokenizer.train(very_long_training_string, vocab_size=32768)
95+
tokenizer.register_special_tokens({"<|endoftext|>": 32768})
96+
tokenizer.encode("<|endoftext|>hello world")
97+
```
98+
99+
You can of course add more tokens after that as well, as you like. Finally, I'd like to stress that I tried hard to keep the code itself clean, readable and hackable. You should not have feel scared to read the code and understand how it works. The tests are also a nice place to look for more usage examples. That reminds me:
52100

53101
## tests
54102

@@ -64,7 +112,7 @@ to run the tests. (-v is verbose, slightly prettier).
64112

65113
- write a more optimized Python version that could run over large files and big vocabs
66114
- write an even more optimized C or Rust version (think through)
67-
- rename GPT4Tokenizer to GPTTokenizer and support GPT-2 as well?
115+
- rename GPT4Tokenizer to GPTTokenizer and support GPT-2/GPT-3/GPT-3.5 as well?
68116
- write a LlamaTokenizer similar to GPT4Tokenizer (i.e. attempt sentencepiece equivalent)
69117
- video coming soon ;)
70118

minbpe/gpt4.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -57,7 +57,7 @@ class GPT4Tokenizer(RegexTokenizer):
5757
"""Lightweight wrapper on RegexTokenizer that matches GPT-4's tokenizer."""
5858

5959
def __init__(self):
60-
super().__init__(pattern=GPT4_SPLIT_PATTERN, special_tokens=GPT4_SPECIAL_TOKENS)
60+
super().__init__(pattern=GPT4_SPLIT_PATTERN)
6161
# get the official tokenizer and its merges
6262
enc = tiktoken.get_encoding("cl100k_base")
6363
mergeable_ranks = enc._mergeable_ranks
@@ -74,6 +74,8 @@ def __init__(self):
7474
# and probably historical, but therefore we have to deal with it here.
7575
self.byte_shuffle = {i: mergeable_ranks[bytes([i])] for i in range(256)}
7676
self.inverse_byte_shuffle = {v: k for k, v in self.byte_shuffle.items()}
77+
# finally register the special tokens
78+
self.register_special_tokens(GPT4_SPECIAL_TOKENS)
7779

7880
def _encode_chunk(self, text_bytes):
7981
# before we start processing bytes, we have to permute them

minbpe/regex.py

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -21,17 +21,17 @@
2121

2222
class RegexTokenizer(Tokenizer):
2323

24-
def __init__(self, pattern=None, special_tokens=None):
24+
def __init__(self, pattern=None):
2525
"""
2626
- pattern: optional string to override the default (GPT-4 split pattern)
2727
- special_tokens: str -> int dictionary of special tokens
2828
example: {'<|endoftext|>': 100257}
2929
"""
3030
super().__init__()
3131
self.pattern = GPT4_SPLIT_PATTERN if pattern is None else pattern
32-
self.special_tokens = {} if special_tokens is None else special_tokens
3332
self.compiled_pattern = re.compile(self.pattern)
34-
self.inverse_special_tokens = {v: k for k, v in self.special_tokens.items()}
33+
self.special_tokens = {}
34+
self.inverse_special_tokens = {}
3535

3636
def train(self, text, vocab_size, verbose=False):
3737
assert vocab_size >= 256
@@ -69,6 +69,12 @@ def train(self, text, vocab_size, verbose=False):
6969
self.merges = merges # used in encode()
7070
self.vocab = vocab # used in decode()
7171

72+
def register_special_tokens(self, special_tokens):
73+
# special_tokens is a dictionary of str -> int
74+
# example: {"<|endoftext|>": 100257}
75+
self.special_tokens = special_tokens
76+
self.inverse_special_tokens = {v: k for k, v in special_tokens.items()}
77+
7278
def decode(self, ids):
7379
# given ids (list of integers), return Python string
7480
part_bytes = []

tests/test_tokenizer.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -111,8 +111,9 @@ def test_save_load(special_tokens):
111111
# take a bit more complex piece of text and train the tokenizer, chosen at random
112112
text = llama_text
113113
# create a Tokenizer and do 64 merges
114-
tokenizer = RegexTokenizer(special_tokens=special_tokens)
114+
tokenizer = RegexTokenizer()
115115
tokenizer.train(text, 256 + 64)
116+
tokenizer.register_special_tokens(special_tokens)
116117
# verify that decode(encode(x)) == x
117118
assert tokenizer.decode(tokenizer.encode(text)) == text
118119
# verify that save/load work as expected

0 commit comments

Comments
 (0)