Skip to content

Commit

Permalink
Store/read csv instead of json, and clean up train.py
Browse files Browse the repository at this point in the history
  • Loading branch information
alexandermorgan committed Aug 5, 2024
1 parent 93cc4bf commit c3fe367
Show file tree
Hide file tree
Showing 4 changed files with 229,129 additions and 96 deletions.
10 changes: 5 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,13 +11,13 @@ data = 'tests/1GB_of_FineWeb-Edu_10B_sample_freq_cutoff_10.json'
tokenizer.train(data, 50304, verbose=True)
```

The example above runs in less than a minute on an old laptop, not bad for a pure python implementation! The purpose of this repo is to be easy to use and tinker with. I hope it helps people think up and quickly try out new tokenization approaches.
The example above runs in less than a minute on an old laptop, not bad for a pure python implementation! The purpose of this repo is to be easy to use and tinker with. I hope it helps people think up and quickly try out new tokenization approaches. More details on this below.

## Origin Story

This repo is a fork of Andrej Karpathy's excellent introduction to the BPE used for LLM tokenization. If you're new to the subject but haven't reviewed Karpathy's resources, definitely start there. He has a [2-hour video lecture](https://www.youtube.com/watch?v=zduSFxRajkE) (and [text version of lecture](https://github.com/karpathy/minbpe/blob/master/lecture.md)), accompanying [minbpe github repo](https://github.com/karpathy/minbpe), and [colab notebook](https://www.youtube.com/redirect?event=video_description&redir_token=QUFFLUhqbEtrVFZtbHhpLUtxWE5aeVNIaUlSNkhpWHdVUXxBQ3Jtc0tuWE9pbHBPZmF2anlYeTZfdTlVXzYyTmREeDNEejZMYnctNk96UnFuMjZBTUVHemkyWjdlWEhYSE56LUNsVFJrakNXeng3NEQxREkwLUFlQWpKa1JHd3JfX3k5dU5TVWFoQzNnWU9XY0lPUElUTUtydw&q=https%3A%2F%2Fsummer-heart-0930.chufeiyun1688.workers.dev%3A443%2Fhttps%2Fcolab.research.google.com%2Fdrive%2F1y0KnCFZvGVf_odSfcNAws6kcDD7HsI0L%3Fusp%3Dsharing&v=zduSFxRajkE). I highly recommend all three. Tokenization is deceptively simple, so a deep dive into the topic is definitely worth it even if you can understand the basics with a 60-second intro.
This repo is a fork of Andrej Karpathy's excellent introduction to the BPE used for LLM tokenization. If you're new to the subject but haven't reviewed Karpathy's resources, definitely start there. He has a [2-hour video lecture](https://www.youtube.com/watch?v=zduSFxRajkE) (and [text version of lecture](https://github.com/karpathy/minbpe/blob/master/lecture.md)), accompanying [minbpe github repo](https://github.com/karpathy/minbpe), and [colab notebook](https://www.youtube.com/redirect?event=video_description&redir_token=QUFFLUhqbEtrVFZtbHhpLUtxWE5aeVNIaUlSNkhpWHdVUXxBQ3Jtc0tuWE9pbHBPZmF2anlYeTZfdTlVXzYyTmREeDNEejZMYnctNk96UnFuMjZBTUVHemkyWjdlWEhYSE56LUNsVFJrakNXeng3NEQxREkwLUFlQWpKa1JHd3JfX3k5dU5TVWFoQzNnWU9XY0lPUElUTUtydw&q=https%3A%2F%2Fsummer-heart-0930.chufeiyun1688.workers.dev%3A443%2Fhttps%2Fcolab.research.google.com%2Fdrive%2F1y0KnCFZvGVf_odSfcNAws6kcDD7HsI0L%3Fusp%3Dsharing&v=zduSFxRajkE). Tokenization is deceptively simple, so a deep dive into the topic is definitely worth it even if you can understand the basics with a 60-second intro.

This BatchBPE repo began as a PR for Karpath's minbpe but developed to the point where the objective changed. Instead of minbpe's pedagogic purpose, BatchBPE aims to be as practical and easy to modify as possible. The goal is to make it easy for people to try out new tokenization ideas even if they're working with limited compute, memory, or hard-disk resources. A lot of making tokenization more accessible boils down to compute and memory optimizations. Using BatchBPE's fastest combination of settings (described below), you can train a GPT2-sized vocabulary (~50k tokens) on 1GB's worth of text in well under a minute on a 4-year old laptop. So yes, trying out new tokenization ideas can happen very quickly. Equally importantly, the repo is in entirely in python to make it easier for the greatest number of people to try out new ideas.
This BatchBPE repo began as a PR for Karpathy's minbpe but developed to the point where the objective changed. Instead of minbpe's pedagogic purpose, BatchBPE aims to be as practical and easy to modify as possible. The goal is to make it easy for people to try out new tokenization ideas even if they're working with limited compute, memory, or hard-disk resources. A lot of making tokenization more accessible boils down to compute and memory optimizations. Using BatchBPE's fastest combination of settings (described below), you can train a GPT2-sized vocabulary (~50k tokens) on 1GB's worth of training text in well under a minute on an old laptop. So trying out new tokenization ideas can happen very quickly. Equally importantly, the repo is in entirely in python to make it easier for the greatest number of people to try out new ideas.

## Two Available Tokenizers

Expand All @@ -27,7 +27,7 @@ There are two Tokenizers in this repository, both of which can perform the 3 pri
1. [batchbpe/batch.py](batchbpe/batch.py): Implements the `BatchTokenizer` which includes a `train` method and `get_stats` and `merge_batch` functions needed to be able to train a new token vocabulary given input text. It inherits all the essentials from the `Tokenizer`.
2. [batchbpe/quick.py](batchbpe/quick.py): Implements the `QuickTokenizer` which is a small speed optimization of the `BatchTokenizer`. It runs ~8% faster by disregarding the issue of overcounting potential merges in sequences of repeated characters (e.g. "aaaaa" counts as only 2 possible "a"-"a" merges in the `BatchTokenizer`, but 4 in the `QuickTokenizer`), and by combining the `get_stats` and `merge_batch` functions into a single function. More importantly, the `QuickTokenizer` serves as a demonstration of how to implement your own new tokenizer that inherits from `Tokenizer` to test out a new tokenization approach or idea.

Finally, the script [train.py](train.py) trains the three major tokenizers on the input text [tests/taylorswift.txt](tests/taylorswift.txt) (this is the Wikipedia entry for her) and saves the vocab to disk for visualization. This script runs in about 25 seconds on my (M1) MacBook.
Finally, the script [train.py](train.py) trains is a variant of the example above training a BatchTokenizer and a QuickTokenizer for for a 10K vocabulary each. This script runs in under 50 seconds on my bottom-of-the-line apple silicon laptop (2020 mba, m1, 8 GB memory).

## Installation

Expand Down Expand Up @@ -135,7 +135,7 @@ tokenizer.train(path, 50304, max_batch_size=1)

### Compress Text into Dictionary Representation

Whatever dataset you pass gets compressed into a dictionary mapping the text chunks (done according to the regex pattern you use) to their counts in the dataset. This is a considerable compression of datasets that is possible given this approach to byte-level merging only within text chunks, not between them. If you like, you can download the entire FineWeb-Edu 10B sample in this dictionary format as a 2-column [csv file I put on HuggingFace](https://huggingface.co/datasets/alexandermorgan/FineWeb-Edu_10B_sample_2_column_word_counts/tree/main). This shrinks the dataset size from \~27 GB spread across 14 parquet files (which are already compressed) to one 240 MB csv file. This is a great speed optimization since you only process each text chunk in the dataset once per batch (weighted for its frequency count), and even more importantly it means that you can work with very large datasets on a basic laptop. As mentioned in above, you can use the `store_dict=True` parameter to save this dictionary representation of your passed dataset as a 2-column csv file.
Whatever dataset you pass gets compressed into a dictionary mapping the text chunks (done according to the regex pattern you use) to their counts in the dataset. This is a considerable compression of datasets that is possible given this approach to byte-level merging only within text chunks, not between them. If you like, you can download the entire FineWeb-Edu 10B sample in this dictionary format as a 2-column [csv file I put on HuggingFace](https://huggingface.co/datasets/alexandermorgan/FineWeb-Edu_10B_sample_2_column_word_counts/tree/main). This shrinks the dataset size from ~27 GB (50-60 GB uncompressed) spread across 14 parquet files to one 240 MB csv file. This is a great speed optimization since you only process each text chunk in the dataset once per batch (weighted for its frequency count), and even more importantly it means that you can work with very large datasets on a basic laptop. You can use the `store_dict=True` parameter to save this dictionary representation of your passed dataset as a 2-column csv file. The last key value pair will be the split pattern you used and a count of 0 since that's not actually in the dataset. The first example in this README was able to process "1 GB's worth" of text because it loads the csv version of the file which was stored with words that appear fewer than 10 times in the dataset removed. The resultant csv file is only 3 MB with only about 10% of the unique text chunks in the text while retaining about 99% of the total text chunks.

### Frequency Cutoff

Expand Down
30 changes: 17 additions & 13 deletions batchbpe/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,9 +13,8 @@
from joblib import Parallel, delayed, cpu_count
import time
import os
import json
import regex as re

import csv


# the main GPT text split patterns, see
Expand Down Expand Up @@ -135,9 +134,11 @@ def _import_data(self, data):
# convert to ChunkedArray, dict, or str of text to parse
if isinstance(item, Dataset):
item = item.data['text']
elif isinstance(item, str) and item.endswith('.json'): # json file from previous data load
elif isinstance(item, str) and item.endswith('.csv'): # csv file from previous data load
with open(item, 'r') as f:
item = json.load(f)
reader = csv.reader(f)
next(reader)
item = {k: int(v) for k, v in reader}
elif isinstance(item, str):
if item.startswith('https://') or item.startswith('http://'):
item = requests.get(item).text # if it's a url, assume it's to a text file
Expand All @@ -148,10 +149,10 @@ def _import_data(self, data):
if isinstance(item, dict):
last_item = item.popitem()
if last_item[1] != 0:
print(f'Warning: the json file passed does not seem to have been made by this tokenizer.')
print(f'Warning: the csv file or dictionary passed does not seem to have been made by this tokenizer.')
item[last_item[0]] = last_item[1]
elif last_item[0] != self.pattern:
print(f'Warning: the dictionary or json file passed did not use the same split pattern.')
print(f'Warning: the dictionary or csv file passed did not use the same split pattern.')
ids.update(item)
elif isinstance(item, str): # assume the string is the text itself
ids.update(re.findall(self.compiled_pattern, item))
Expand All @@ -167,16 +168,19 @@ def _import_data(self, data):
for _dict in item:
ids.update(re.findall(self.compiled_pattern, _dict['text']))

if self.store_dict: # store dict compression of dataset to a json file if requested
if self.store_dict: # store dict compression of dataset to a csv file if requested
ids[self.pattern] = 0 # store the pattern used to split the text as the last key
formatted_time = time.strftime('%Y-%m-%d-%H:%M', time.localtime())
filename = f'{formatted_time}-dataset-dict.json'
formatted_time = time.strftime('%Y-%m-%d-%H_%M', time.localtime())
filename = f'{formatted_time}-dataset-dict.csv'
try:
with open(filename, 'w') as f:
json.dump(ids, f)
with open(filename, 'w', newline='') as f:
writer = csv.writer(f)
writer.writerow(['text_chunk', 'count'])
for key, value in ids.items():
writer.writerow([key, value])
print(f"Stored dictionary of {len(ids)} keys to {filename}")
except:
print('Failed to store dictionary of dataset, continuing on to training...')
print('Failed to store dictionary of dataset.')
del ids[self.pattern] # remove the pattern key from the ids dict

ids = self._id_dict_to_list(ids)
Expand Down Expand Up @@ -257,7 +261,7 @@ def load(self, model_file):
with open(model_file, 'r', encoding="utf-8") as f:
# read the version
version = f.readline().strip()
assert version == "minbpe v1"
assert version == "BatchBPE v1"
# read the pattern
self.pattern = f.readline().strip()
# read the special tokens
Expand Down
Loading

0 comments on commit c3fe367

Please sign in to comment.