Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md
bpe.py	bpe.py
hf_slow.png	hf_slow.png
minimal_hf_tok.py	minimal_hf_tok.py
save_hf.py	save_hf.py
vocab.json	vocab.json
walkthrough.ipynb	walkthrough.ipynb

Diving into the HuggingFace tokenizer
- What makes up a HuggingFace tokenizer?
  - BPE Tokenizer
  - WordPiece tokenizer
- Data Structures and Methods
  - __call__
  - decode
A minimal implementation
Step-by-step walkthrough
Next Chapter

Diving into the HuggingFace tokenizer

What makes up a HuggingFace tokenizer?

Well, let's first think about state: what information does a tokenizer need to save? Before we dive in, it's helpful to - you won't believe this - actually check out the saved tokenizers for different models. For example, here's the GPT-2 Tokenizer. This is actually saved in the older format, so we can take a look at, for example, Falcon's tokenizer. Make sure to scroll through the large tokenizer.json files to get an idea for what's in there.

BPE Tokenizer

Let's consider a BPE tokenizer. In HuggingFace, you can save a tokenizer by calling the save_pretained method. Typically, you will see the following files for a BPE tokenizer:

[DEPR] added_tokens.json: Part of the older format for saving HF tokenizers. A little hard to figure out what this is for, since we have an "added_tokens" entry in the tokenizer.json file itself. Further, this doesn't actually have all the AddedTokens of your tokenizer (this inc. special tokens for some tokenizers like DeBERTa, Llama).
[DEPR] merges.txt : Saved in the older format for BPE tokenizers. Contains a list of BPE merge rules to be used while encoding a text sequence.
special_tokens_map.json: A dictionary of special token attribute names ("bos_token", etc) and their values ("<BOS>") and some metadata. What makes special tokens so special? These are commonly used tokens that are not a part of the corpus but have certain important designations (BOS- beginning of sequence, EOS-end of sequence, etc). All of these special tokens are accesible as attributes of the tokenizer directly i.e you can call tokenizer.eos_token for any HF tokenizer, since they all subclass the SpecialTokensMixin class. Maintaining such additional information is a good idea for obvious reasons- none of these are actually a part of your training corpus. You'd also want to add certain special tokens when you encode a piece of text by default (EOS or BOS+EOS, etc). This is the postprocessing step, covered in chapter-6. In 🤗Tokenizers, you can also add additional_special_tokens, which can be tokens you use in the model's prompt templates (like [INSTR], etc).
tokenizer_config.json : Some tokenizer specific config parameters such as max sequence length the model was trained on (model_max_length), some information on special tokens, etc.
tokenizer.json: Some notable entries:
- add_bos_token: State for whether to add BOS token by default when you call the tokenizer. Caveats on this later.
- added_tokens: a list of new tokens added via tokenizer.add_tokens/ tokens in additional_special_tokens. When you call tokenizer.add_tokens, the new token added is, by default, maintained as an AddedToken object, and not just a string. The difference is that an AddedToken can have special behaviour - you might match both <ADD> and <ADD> (note the left whitespace) to be the same token, specify whether the token should be matched in a normalized version of the text, etc.
- model: Information about the tokenizer architecture/ algorithm ("type" -> "BPE" for ex). Also includes the vocabulary (mapping tokens -> token ids), and additional state such as merge rules for BPE. Each merge rule is really just a tuple of tokens to merge. 🤗 stores this tuple as one string, space-separated ex: "i am".
- normalizer: Normalizer to use before segmentation. null for GPT2 and Falcon.
[DEPR] vocab.json : Saved in the older format. Contains a dictionary mapping tokens to token ids. This information is now stored in tokenizer.json.

WordPiece tokenizer

raise NotImplementedError

Data Structures and Methods

Let's take a look at how a HF tokenizer stores the vocabulary, added tokens, etc along with the different functionality it provides (as always, these are tightly coupled). For simplicity, I am only going to look into the slow tokenizers, implemented in Python, as opposed to the fast tokenizers implemented in Rust, as I basically haven't learnt Rust yet (my apologies to the Cargo cult). Here's what the initialization looks like:

So are all the tokens stored in a prefix tree/ Trie? No! This is only for added_tokens. For example, with GPT2, this trie will only store one token by default: <|endoftext|>. For some custom tokenizers like ByT5, the number of added tokens is in the hundreds, and so using a Trie makes a difference. This becomes useful when you are customizing your tokenizer by adding new tokens with the tokenizer.add_tokens method. (Reference). The added_tokens Trie has two methods:

trie.add(word) : Adds a word to the prefix tree.
trie.split(text): Splits a string into chunks, separated at the boundaries of tokens in the trie. Ex: This is <|myspecialtoken|> -> ["This is ", "<|myspecialtoken|>"]

To look at the other attributes/data structures stored, we'd need to move away from the parent class and actually go to the model-specific tokenizer. Here, this is GPT2Tokenizer. Some of the attributes are:

encoder - Vocabulary, keeping token -> token_id mappings
decoder - Inverse of the encoder, keeping token_id -> token mappings
bpe_ranks - Mapping between merge rule token_1 token_2 and priority/rank. Merges which happened earlier in training have a lower rank, and thus higher priority i.e these merges should happen earlier than later while tokenizing a string.

There are some more details here, but left for later. Let's quickly go over the summary for important methods first.

`call`

Okay, so what happens when you do call tokenizer(text)? An example with gpt2:

tokenizer = AutoTokenizer.from_pretrained("gpt2", use_fast=False) # get the slow tokenizer
print(tokenizer("The slow tokenizer")) # Output: {'input_ids': [464, 3105, 11241, 7509], 'attention_mask': [1, 1, 1, 1]}

You can see that the result is in fact a dictionary. input_ids are the token ids for the input sequence. If you decode the above sequence to get the actual tokens, you get ['The', ' slow', ' token', 'izer']. Let's look at what happens inside the __call__ method to get this result. The slow tokenizer class PreTrainedTokenizer derives the __call__ method from the parent class PreTrainedTokenizerBase, in which __call__ basically parses input arguments to make a call to the encode_plus function. HuggingFace tokenizers have two methods for encoding: .encode(), which gives you just a list of input_ids, and encode_plus(), which returns a dictionary with some additional information (attention_mask, token_type_ids to mark sequence boundaries, etc). The encode_plus implementation for the slow tokenizer (in reality, this is _encode_plus) is as follows(This is):

Normalize and pre-tokenize input text. With GPT2, the pre-tokenization involves breaking up the text on whitespace, contractions, punctuations, etc.
Tokenize input string/strings to get a list of tokens for each input string. This is handled by the .tokenize() method. (Segmentation)
Convert tokens to token ids using .convert_tokens_to_ids() method. (Numericalization)
Send in the token ids and other kwargs to .prepare_for_model(), which finally returns a dictionary with attention_mask and other keys if needed.

This is the simple explanation. There's one important detail though: When you have added_tokens or special tokens, there are no merge rules for these tokens! And you can't make up ad-hoc merge rules without messing up the tokenization of other strings. So, we need to handle this in the pre-tokenization step - Along with splitting on whitespace, punctuations, etc, we will also split at the boundaries of added_tokens.

`decode`

When you run tok.decode(token_ids), there are three operations:

Convert ids to tokens using the id_to_token mapping from tok.bpe.
Join all the tokens
Replace unicode symbols with normal characters

A minimal implementation

This folder contains two .py files:

bpe.py: Implements a simple BPE class that tokenizes a string according to GPT-2's byte-level BPE algorithm (a simple change to standard BPE).
minimal_hf_tok.py: Implements MySlowTokenizer, <100 line implementation for the basic features of HuggingFace's GPT2Tokenizer (the slow version).

Step-by-step walkthrough

Head over to walkthrough.ipynb for details on:

Implementing the merging algorithm for BPE
Implementing the different methods for encoding, decoding, added tokens etc. in MySlowTokenizer to match GPT2Tokenizer.

Next Chapter

We'll be going over the challenges with tokenizing different types of data - numbers, other languages, etc.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

3-hf-tokenizer

3-hf-tokenizer

README.md

Table of Contents

Diving into the HuggingFace tokenizer

What makes up a HuggingFace tokenizer?

BPE Tokenizer

WordPiece tokenizer

Data Structures and Methods

`call`

`decode`

A minimal implementation

Step-by-step walkthrough

Next Chapter

Files

3-hf-tokenizer

Directory actions

More options

Directory actions

More options

Latest commit

History

3-hf-tokenizer

Folders and files

parent directory

README.md

Table of Contents

Diving into the HuggingFace tokenizer

What makes up a HuggingFace tokenizer?

BPE Tokenizer

WordPiece tokenizer

Data Structures and Methods

__call__

decode

A minimal implementation

Step-by-step walkthrough

Next Chapter

`call`

`decode`