Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
LICENSE		LICENSE
README.md		README.md
bpe_basic.py		bpe_basic.py
bpe_regex.py		bpe_regex.py
taylorswift.txt		taylorswift.txt
test_gpt4.py		test_gpt4.py
train.py		train.py

Repository files navigation

minbpe

Minimal, clean, educational code for the (byte-level) Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization. The BPE algorithm is "byte-level" because it runs on UTF-8 encoded strings.

This algorithm was popularized for LLMs by the GPT-2 paper and the associated GPT-2 code release from OpenAI. Today, all modern LLMs (e.g. GPT, Llama, Mistral) use this algorithm to train their tokenizers.

There are two Tokenizers in this repository, both of which can perform the 3 primary functions of a Tokenizer: 1) train the tokenizer vocabulary and merges on a given text, 2) encode from text to tokens, 3) decode from tokens to text. The two tokenizers are:

bpe_basic.py: The simplest implementation of the BPE algorithm that runs directly on text.
bpe_regex.py: This implementation further splits the input text by a regex pattern, which is a preprocessing stage that splits up the input text by categories (think: letters, numbers, punctuation) before tokenization. This ensures that no merges will happen across category boundaries. This was introduced in the GPT-2 paper and continues to be in use as of GPT-4.

Finally, the script train.py trains both of these tokenizers on the input text taylorswift.txt (this is the Wikipedia entry for her kek) and saves the vocab to disk for visualization. This script runs in about 25 seconds on my (M1) MacBook.

reproducing tiktoken GPT-4

The correctness of this code is also established by exactly reproducing the tiktoken library, and its encoding/decoding with the GPT-4 tokenizer. In particular, we can take the _mergeable_ranks from the GPT4 tokenizer:

enc = tiktoken.get_encoding("cl100k_base")
mergeable_ranks = enc._mergeable_ranks

And use them to construct a RegexTokenizer that will exactly reproduce the tokenization of GPT4. Run and step through the file test_gpt4.py for details.

Note that the parity is not complete because we do not handle special tokens.

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

minbpe

reproducing tiktoken GPT-4

License

About

Releases

Packages

Languages

License

fortarch/minbpe

Folders and files

Latest commit

History

Repository files navigation

minbpe

reproducing tiktoken GPT-4

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages