Python API with C extensions for faster training and encoding

I've been working on a C version which is based on the GPT2 paper: [bytephase](https://github.com/benarnav/bytephase)

It provides a Python API with C extensions to significantly speed up training and encoding. The project has a thorough README and complete docstring for all class methods. 
`bytephase` will be a good addition considering the 2nd point in `todo` section: "write an even more optimized C or Rust version (think through)"

I'd like to add `bytephase` to the community extension section of the README, encouraging more developers to review and contribute to this implementation and possibly build more features into it (ex. GPT4 support, loading other pretrained tokenizers).

Example usage:
```python
from bytephase import Tokenizer

# Initialize and train
tokenizer = Tokenizer()
# OR select a custom regex pattern, defaults to the GPT2 pattern
custom_pattern = r'\w+|\s+|[^\w\s]+'
tokenizer = Tokenizer(pattern=custom_pattern)

tokenizer.train("path/to/your_data.txt", vocab_size=10000)

# Encode
encoded = tokenizer.encode("Hello, world!")
# [1869, 574, 111, 44, 1560, 33]

# Decode
decoded = tokenizer.decode(encoded)
# "Hello, world!"

# Save and load
tokenizer.save("saved_tokenizer")
tokenizer.load("saved_tokenizer.bpe")
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python API with C extensions for faster training and encoding #85

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Python API with C extensions for faster training and encoding #85

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions