Store/read csv instead of json, and clean up train.py

alexandermorgan · Aug 5, 2024 · c3fe367 · c3fe367
1 parent 93cc4bf
commit c3fe367
Show file tree

Hide file tree

Showing 4 changed files with 229,129 additions and 96 deletions.
diff --git a/README.md b/README.md
@@ -11,13 +11,13 @@ data = 'tests/1GB_of_FineWeb-Edu_10B_sample_freq_cutoff_10.json'
 tokenizer.train(data, 50304, verbose=True)
 ```
 
-The example above runs in less than a minute on an old laptop, not bad for a pure python implementation! The purpose of this repo is to be easy to use and tinker with. I hope it helps people think up and quickly try out new tokenization approaches.
+The example above runs in less than a minute on an old laptop, not bad for a pure python implementation! The purpose of this repo is to be easy to use and tinker with. I hope it helps people think up and quickly try out new tokenization approaches. More details on this below.
 
 ## Origin Story
 
-This repo is a fork of Andrej Karpathy's excellent introduction to the BPE used for LLM tokenization. If you're new to the subject but haven't reviewed Karpathy's resources, definitely start there. He has a [2-hour video lecture](https://www.youtube.com/watch?v=zduSFxRajkE) (and [text version of lecture](https://github.com/karpathy/minbpe/blob/master/lecture.md)), accompanying [minbpe github repo](https://github.com/karpathy/minbpe), and [colab notebook](https://www.youtube.com/redirect?event=video_description&redir_token=QUFFLUhqbEtrVFZtbHhpLUtxWE5aeVNIaUlSNkhpWHdVUXxBQ3Jtc0tuWE9pbHBPZmF2anlYeTZfdTlVXzYyTmREeDNEejZMYnctNk96UnFuMjZBTUVHemkyWjdlWEhYSE56LUNsVFJrakNXeng3NEQxREkwLUFlQWpKa1JHd3JfX3k5dU5TVWFoQzNnWU9XY0lPUElUTUtydw&q=https%3A%2F%2Fsummer-heart-0930.chufeiyun1688.workers.dev%3A443%2Fhttps%2Fcolab.research.google.com%2Fdrive%2F1y0KnCFZvGVf_odSfcNAws6kcDD7HsI0L%3Fusp%3Dsharing&v=zduSFxRajkE). I highly recommend all three. Tokenization is deceptively simple, so a deep dive into the topic is definitely worth it even if you can understand the basics with a 60-second intro.
+This repo is a fork of Andrej Karpathy's excellent introduction to the BPE used for LLM tokenization. If you're new to the subject but haven't reviewed Karpathy's resources, definitely start there. He has a [2-hour video lecture](https://www.youtube.com/watch?v=zduSFxRajkE) (and [text version of lecture](https://github.com/karpathy/minbpe/blob/master/lecture.md)), accompanying [minbpe github repo](https://github.com/karpathy/minbpe), and [colab notebook](https://www.youtube.com/redirect?event=video_description&redir_token=QUFFLUhqbEtrVFZtbHhpLUtxWE5aeVNIaUlSNkhpWHdVUXxBQ3Jtc0tuWE9pbHBPZmF2anlYeTZfdTlVXzYyTmREeDNEejZMYnctNk96UnFuMjZBTUVHemkyWjdlWEhYSE56LUNsVFJrakNXeng3NEQxREkwLUFlQWpKa1JHd3JfX3k5dU5TVWFoQzNnWU9XY0lPUElUTUtydw&q=https%3A%2F%2Fsummer-heart-0930.chufeiyun1688.workers.dev%3A443%2Fhttps%2Fcolab.research.google.com%2Fdrive%2F1y0KnCFZvGVf_odSfcNAws6kcDD7HsI0L%3Fusp%3Dsharing&v=zduSFxRajkE). Tokenization is deceptively simple, so a deep dive into the topic is definitely worth it even if you can understand the basics with a 60-second intro.
 
-This BatchBPE repo began as a PR for Karpath's minbpe but developed to the point where the objective changed. Instead of minbpe's pedagogic purpose, BatchBPE aims to be as practical and easy to modify as possible. The goal is to make it easy for people to try out new tokenization ideas even if they're working with limited compute, memory, or hard-disk resources. A lot of making tokenization more accessible boils down to compute and memory optimizations. Using BatchBPE's fastest combination of settings (described below), you can train a GPT2-sized vocabulary (~50k tokens) on 1GB's worth of text in well under a minute on a 4-year old laptop. So yes, trying out new tokenization ideas can happen very quickly. Equally importantly, the repo is in entirely in python to make it easier for the greatest number of people to try out new ideas.
+This BatchBPE repo began as a PR for Karpathy's minbpe but developed to the point where the objective changed. Instead of minbpe's pedagogic purpose, BatchBPE aims to be as practical and easy to modify as possible. The goal is to make it easy for people to try out new tokenization ideas even if they're working with limited compute, memory, or hard-disk resources. A lot of making tokenization more accessible boils down to compute and memory optimizations. Using BatchBPE's fastest combination of settings (described below), you can train a GPT2-sized vocabulary (~50k tokens) on 1GB's worth of training text in well under a minute on an old laptop. So trying out new tokenization ideas can happen very quickly. Equally importantly, the repo is in entirely in python to make it easier for the greatest number of people to try out new ideas.
 
 ## Two Available Tokenizers
 
@@ -27,7 +27,7 @@ There are two Tokenizers in this repository, both of which can perform the 3 pri
 1. [batchbpe/batch.py](batchbpe/batch.py): Implements the `BatchTokenizer` which includes a `train` method and `get_stats` and `merge_batch` functions needed to be able to train a new token vocabulary given input text. It inherits all the essentials from the `Tokenizer`.
 2. [batchbpe/quick.py](batchbpe/quick.py): Implements the `QuickTokenizer` which is a small speed optimization of the `BatchTokenizer`. It runs ~8% faster by disregarding the issue of overcounting potential merges in sequences of repeated characters (e.g. "aaaaa" counts as only 2 possible "a"-"a" merges in the `BatchTokenizer`, but 4 in the `QuickTokenizer`), and by combining the `get_stats` and `merge_batch` functions into a single function. More importantly, the `QuickTokenizer` serves as a demonstration of how to implement your own new tokenizer that inherits from `Tokenizer` to test out a new tokenization approach or idea.
 
-Finally, the script [train.py](train.py) trains the three major tokenizers on the input text [tests/taylorswift.txt](tests/taylorswift.txt) (this is the Wikipedia entry for her) and saves the vocab to disk for visualization. This script runs in about 25 seconds on my (M1) MacBook.
+Finally, the script [train.py](train.py) trains is a variant of the example above training a BatchTokenizer and a QuickTokenizer for for a 10K vocabulary each. This script runs in under 50 seconds on my bottom-of-the-line apple silicon laptop (2020 mba, m1, 8 GB memory).
 
 ## Installation
 
@@ -135,7 +135,7 @@ tokenizer.train(path, 50304, max_batch_size=1)
 
 ### Compress Text into Dictionary Representation
 
-Whatever dataset you pass gets compressed into a dictionary mapping the text chunks (done according to the regex pattern you use) to their counts in the dataset. This is a considerable compression of datasets that is possible given this approach to byte-level merging only within text chunks, not between them. If you like, you can download the entire FineWeb-Edu 10B sample in this dictionary format as a 2-column [csv file I put on HuggingFace](https://huggingface.co/datasets/alexandermorgan/FineWeb-Edu_10B_sample_2_column_word_counts/tree/main). This shrinks the dataset size from \~27 GB spread across 14 parquet files (which are already compressed) to one 240 MB csv file. This is a great speed optimization since you only process each text chunk in the dataset once per batch (weighted for its frequency count), and even more importantly it means that you can work with very large datasets on a basic laptop. As mentioned in above, you can use the `store_dict=True` parameter to save this dictionary representation of your passed dataset as a 2-column csv file.
+Whatever dataset you pass gets compressed into a dictionary mapping the text chunks (done according to the regex pattern you use) to their counts in the dataset. This is a considerable compression of datasets that is possible given this approach to byte-level merging only within text chunks, not between them. If you like, you can download the entire FineWeb-Edu 10B sample in this dictionary format as a 2-column [csv file I put on HuggingFace](https://huggingface.co/datasets/alexandermorgan/FineWeb-Edu_10B_sample_2_column_word_counts/tree/main). This shrinks the dataset size from ~27 GB (50-60 GB uncompressed) spread across 14 parquet files to one 240 MB csv file. This is a great speed optimization since you only process each text chunk in the dataset once per batch (weighted for its frequency count), and even more importantly it means that you can work with very large datasets on a basic laptop. You can use the `store_dict=True` parameter to save this dictionary representation of your passed dataset as a 2-column csv file. The last key value pair will be the split pattern you used and a count of 0 since that's not actually in the dataset. The first example in this README was able to process "1 GB's worth" of text because it loads the csv version of the file which was stored with words that appear fewer than 10 times in the dataset removed. The resultant csv file is only 3 MB with only about 10% of the unique text chunks in the text while retaining about 99% of the total text chunks.
 
 ### Frequency Cutoff
 

diff --git a/batchbpe/base.py b/batchbpe/base.py
@@ -13,9 +13,8 @@
 from joblib import Parallel, delayed, cpu_count
 import time
 import os
-import json
 import regex as re
-
+import csv
 
 
 # the main GPT text split patterns, see
@@ -135,9 +134,11 @@ def _import_data(self, data):
             # convert to ChunkedArray, dict, or str of text to parse
             if isinstance(item, Dataset):
                 item = item.data['text']
-            elif isinstance(item, str) and item.endswith('.json'):   # json file from previous data load
+            elif isinstance(item, str) and item.endswith('.csv'):   # csv file from previous data load
                 with open(item, 'r') as f:
-                    item = json.load(f)
+                    reader = csv.reader(f)
+                    next(reader)
+                    item = {k: int(v) for k, v in reader}
             elif isinstance(item, str):
                 if item.startswith('https://') or item.startswith('http://'):
                     item = requests.get(item).text    # if it's a url, assume it's to a text file
@@ -148,10 +149,10 @@ def _import_data(self, data):
             if isinstance(item, dict):
                 last_item = item.popitem()
                 if last_item[1] != 0:
-                    print(f'Warning: the json file passed does not seem to have been made by this tokenizer.')
+                    print(f'Warning: the csv file or dictionary passed does not seem to have been made by this tokenizer.')
                     item[last_item[0]] = last_item[1]
                 elif last_item[0] != self.pattern:
-                    print(f'Warning: the dictionary or json file passed did not use the same split pattern.')
+                    print(f'Warning: the dictionary or csv file passed did not use the same split pattern.')
                 ids.update(item)
             elif isinstance(item, str):   # assume the string is the text itself
                 ids.update(re.findall(self.compiled_pattern, item))
@@ -167,16 +168,19 @@ def _import_data(self, data):
                 for _dict in item:
                     ids.update(re.findall(self.compiled_pattern, _dict['text']))
 
-        if self.store_dict:   # store dict compression of dataset to a json file if requested
+        if self.store_dict:   # store dict compression of dataset to a csv file if requested
             ids[self.pattern] = 0   # store the pattern used to split the text as the last key
-            formatted_time = time.strftime('%Y-%m-%d-%H:%M', time.localtime())
-            filename = f'{formatted_time}-dataset-dict.json'
+            formatted_time = time.strftime('%Y-%m-%d-%H_%M', time.localtime())
+            filename = f'{formatted_time}-dataset-dict.csv'
             try:
-                with open(filename, 'w') as f:
-                    json.dump(ids, f)
+                with open(filename, 'w', newline='') as f:
+                    writer = csv.writer(f)
+                    writer.writerow(['text_chunk', 'count'])
+                    for key, value in ids.items():
+                        writer.writerow([key, value])
                 print(f"Stored dictionary of {len(ids)} keys to {filename}")
             except:
-                print('Failed to store dictionary of dataset, continuing on to training...')
+                print('Failed to store dictionary of dataset.')
             del ids[self.pattern]   # remove the pattern key from the ids dict
 
         ids = self._id_dict_to_list(ids)
@@ -257,7 +261,7 @@ def load(self, model_file):
         with open(model_file, 'r', encoding="utf-8") as f:
             # read the version
             version = f.readline().strip()
-            assert version == "minbpe v1"
+            assert version == "BatchBPE v1"
             # read the pattern
             self.pattern = f.readline().strip()
             # read the special tokens