-
logos
Create ridiculously fast Lexers
-
tokenizers
today's most used tokenizers, with a focus on performances and versatility
-
xmlparser
Pull-based, zero-allocation XML parser
-
text-splitter
Split text into semantic chunks, up to a desired chunk size. Supports calculating length by characters and tokens, and is callable from Rust and Python.
-
charabia
detect the language, tokenize the text and normalize the tokens
-
html5gum
A WHATWG-compliant HTML5 tokenizer and tag soup parser
-
htmlparser
Pull-based, zero-allocation HTML parser
-
llm-tokenizer
LLM tokenizer library with caching and chat template support
-
lindera-tantivy
Lindera Tokenizer for Tantivy
-
scnr2
Scanner/Lexer with regex patterns and multiple modes
-
erl_tokenize
Erlang source code tokenizer
-
azul-simplecss
A very simple CSS 2.1 tokenizer with CSS nesting support
-
elyze
extensible general purpose framework parser allowing to parser any type of data without allocation
-
vaporetto
pointwise prediction based tokenizer
-
normy
Ultra-fast, zero-copy text normalization for Rust NLP pipelines & tokenizers
-
bpe
Fast byte-pair encoding implementation
-
splintr
Fast Rust BPE tokenizer with Python bindings
-
octofhir-fhirpath-parser
Parser and tokenizer for FHIRPath expressions
-
tokstream-cli
CLI token stream simulator using Hugging Face tokenizers
-
logos-codegen
Create ridiculously fast Lexers
-
vibrato
viterbi-based accelerated tokenizer
-
noa-parser
Noa parser is an extensible general purpose framework parser allowing to parser any type of data without allocation
-
libsimple
Rust bindings to simple, a SQLite3 fts5 tokenizer which supports Chinese and PinYin
-
smoltok-core
Byte-Pair Encoding tokenizer implementation in Rust
-
llm-utl
Convert code repositories into LLM-friendly prompts with smart chunking and filtering
-
wordchipper
HPC Rust LLM Tokenizer Library
-
vibrato-rkyv
Vibrato: viterbi-based accelerated tokenizer with rkyv support for fast dictionary loading
-
bbpe
Binary byte pair encoding (BPE) trainer and CLI compatible with Hugging Face tokenizers
-
bleuscore
A fast bleu score calculator
-
scnr2_generate
Scanner/Lexer with regex patterns and multiple modes
-
lexerus
annotated lexer
-
tantivy-stemmers
A collection of Tantivy stemmer tokenizers
-
unobtanium-segmenter
A text segmentation toolbox for search applications inspired by charabia and tantivy
-
scnr
Scanner/Lexer with regex patterns and multiple modes
-
nlpo3
Thai natural language processing library, with Python and Node bindings
-
rust_tokenizers
High performance tokenizers for Rust
-
gemini-tokenizer
Authoritative Gemini tokenizer for Rust, ported from the official Google Python GenAI SDK
-
libsql-sqlite3-parser
SQL parser (as understood by SQLite) (libsql fork)
-
svgrtypes
SVG types parser
-
rust-forth-tokenizer
A Forth tokenizer written in Rust
-
tokstream-core
Core tokenizer streaming engine for tokstream
-
logos-cli
Create ridiculously fast Lexers
-
trustformers-tokenizers
Tokenizers for TrustformeRS
-
lexers
Tools for tokenizing and scanning
-
avila-tokenizers
The most complete tokenizer library in Rust - BPE, WordPiece, Unigram, with native support for GPT, BERT, Llama, Claude
-
bundle_repo
Pack a local or remote Git Repository to XML for LLM Consumption
-
mecab-ko
한국어 형태소 분석기 - MeCab-Ko의 순수 Rust 구현
-
bpe-openai
Prebuilt fast byte-pair encoders for OpenAI
-
tokenx-rs
Fast token count estimation for LLMs at 96% accuracy without a full tokenizer
-
lindera-tokenizer
A morphological analysis library
-
language-tokenizer
Text tokenizer for linguistic purposes, such as text matching. Supports more than 40 languages, including English, French, Russian, Japanese, Thai etc.
-
html5tokenizer
An HTML5 tokenizer with code span support
-
mipl
Minimal Imperative Parsing Library
-
tekken-rs
Mistral Tekken tokenizer with audio support
-
mecab-ko-core
한국어 형태소 분석 핵심 엔진 - Lattice, Viterbi, 토크나이저
-
segtok
Sentence segmentation and word tokenization tools
-
punkt
sentence tokenizer
-
alkale
LL(1) lexer library for Rust
-
fuzzy-pickles
A low-level parser of Rust source code with high-level visitor implementations
-
kitoken
Fast and versatile tokenizer for language models, supporting BPE, Unigram and WordPiece tokenization
-
indent_tokenizer
Generate tokens based on indentation
-
vaporetto_rules
Rule-base filters for Vaporetto
-
unitoken
Fast BPE tokenizer/trainer with a Rust core and Python bindings
-
sql-script-parser
iterates over SQL statements in SQL script
-
tokenizers-enfer
today's most used tokenizers, with a focus on performances and versatility
-
tokenmonster
Greedy tiktoken-like tokenizer with embedded vocabulary (cl100k-base approximator)
-
giron
ECMAScript parser which outputs ESTree JSON
-
code-splitter
Split code into semantic chunks using tree-sitter
-
blex
A lightweight lexing framework
-
tantivy-czech-stemmer
Czech stemmer as Tantivy tokenizer
-
tokenizer
Thai text tokenizer
-
punkt_n
Punkt sentence tokenizer
-
javascript_lexer
Javascript lexer
-
elizaos-plugin-local-embedding
Local text embedding and tokenization plugin for elizaOS - Rust implementation
-
tinytoken
tokenizing text into words, numbers, symbols, and more, with customizable parsing options
-
regex-lexer
A regex-based lexer (tokenizer)
-
lexariel
Lexical analyzer for Asmodeus language
-
smoltoken
A fast library for Byte Pair Encoding (BPE) tokenization
-
tele_tokenizer
A CSS tokenizer
-
tokengeex
efficient tokenizer for code based on UnigramLM and TokenMonster
-
uscan
A universal source code scanner
-
regex-lexer-lalrpop
A regex-based lexer (tokenizer)
-
char-lex
Create easy enum based lexers
-
aleph-alpha-tokenizer
A fast implementation of a wordpiece-inspired tokenizer
-
xxcalc
Embeddable or standalone robust floating-point polynomial calculator
-
pgn-lexer
A lexer for PGN files for chess. Provides an iterator over the tokens from a byte stream.
-
tokenise
A flexible tokeniser library for parsing text
-
simple-cursor
A super simple character cursor implementation geared towards lexers/tokenizers
-
nipah_tokenizer
A powerful yet simple text tokenizer for your everyday needs!
-
tokeneer
tokenizer crate
-
json-parser
JSON parser
-
blingfire
Wrapper for the BlingFire tokenization library
-
basic_lexer
Basic lexical analyzer for parsing and compiling
-
regex-tokenizer
A regex tokenizer
-
pretok
A string pre-tokenizer for C-like syntaxes
-
scanny
A advanced text scanning library for Rust
-
sana
Create lexers easily
-
gpt_tokenizer
Rust BPE Encoder Decoder (Tokenizer) for GPT-2 / GPT-3
-
rustpotion
Blazingly fast word embeddings with Tokenlearn
-
token
string-tokenizer (and sentence splitter) Note: If you find that you would like to use the name for something more appropriate, please just send me a mail at jaln at itu dot dk
-
tuker
A small tokenizer/parser library with an emphasis on usability
-
scnr2_macro
Scanner/Lexer with regex patterns and multiple modes
Try searching with DuckDuckGo.