-
tokenizers
today's most used tokenizers, with a focus on performances and versatility
-
svgtypes
SVG types parser
-
xmlparser
Pull-based, zero-allocation XML parser
-
markdown
CommonMark compliant markdown parser in Rust with ASTs and extensions
-
charabia
detect the language, tokenize the text and normalize the tokens
-
html5gum
A WHATWG-compliant HTML5 tokenizer and tag soup parser
-
sqlite3-parser
SQL parser (as understood by SQLite)
-
htmlparser
Pull-based, zero-allocation HTML parser
-
styx-tokenizer
Tokenizer for the Styx configuration language
-
llm-tokenizer
LLM tokenizer library with caching and chat template support
-
lindera-tantivy
Lindera Tokenizer for Tantivy
-
sentencepiece
Binding for the sentencepiece tokenizer
-
azul-simplecss
A very simple CSS 2.1 tokenizer with CSS nesting support
-
vaporetto
pointwise prediction based tokenizer
-
erl_tokenize
Erlang source code tokenizer
-
bpe
Fast byte-pair encoding implementation
-
momoa
A JSON parsing library suitable for static analysis
-
octofhir-fhirpath-parser
Parser and tokenizer for FHIRPath expressions
-
splintr
Fast Rust BPE tokenizer with Python bindings
-
vibrato
viterbi-based accelerated tokenizer
-
tokstream-cli
CLI token stream simulator using Hugging Face tokenizers
-
toktrie_hf_tokenizers
HuggingFace tokenizers library support for toktrie and llguidance
-
pred-recdec
Predicated Recursive Descent Parsing with BNF and impure hooks
-
rwkv-tokenizer
A fast RWKV Tokenizer
-
kanpyo-dict
Dictionary Library for Kanpyo
-
huggingface/tokenizers-python
💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
-
vibrato-rkyv
Vibrato: viterbi-based accelerated tokenizer with rkyv support for fast dictionary loading
-
pinyinchch
一个拼音转汉字的工具库
-
bleuscore
A fast bleu score calculator
-
lindera-python
A morphological analysis libraries and command line interface
-
unscanny
Painless string scanning
-
jayce
tokenizer 🌌
-
toak-ocr
OCR engine with Apple Vision framework support for macOS
-
tergo-formatter
Formatter for tergo
-
tergo-parser
Parser for tergo
-
tergo-tokenizer
R language tokenizer
-
europa
A lightweight AI utilities library for Rust
-
unobtanium-segmenter
A text segmentation toolbox for search applications inspired by charabia and tantivy
-
go-brrr
Token-efficient code analysis for LLMs - Rust implementation
-
nlpo3
Thai natural language processing library, with Python and Node bindings
-
sqlite-simple-tokenizer
This's a run-time loadable extension of SQLite fts5, supports Chinese and pinyin word segmentation and search
-
rust_tokenizers
High performance tokenizers for Rust
-
mkcontext
that provides functionality for creating context
-
libsql-sqlite3-parser
SQL parser (as understood by SQLite) (libsql fork)
-
cuttle
A large language model inference engine in Rust
-
turso_sqlite3_parser
SQL parser (as understood by SQLite)
-
tokstream-core
Core tokenizer streaming engine for tokstream
-
tantivy-tokenizer-api
Tokenizer API of tantivy
-
text-tokenizer
Custom text tokenizer
-
rust-forth-tokenizer
A Forth tokenizer written in Rust
-
trustformers-tokenizers
Tokenizers for TrustformeRS
-
scanlex
lexical scanner for parsing text into tokens
-
avila-tokenizers
The most complete tokenizer library in Rust - BPE, WordPiece, Unigram, with native support for GPT, BERT, Llama, Claude
-
syn_derive
Derive macros for
syn::Parseandquote::ToTokens -
lexers
Tools for tokenizing and scanning
-
sqlite-jieba-tokenizer
This's a run-time loadable extension of SQLite fts5, supports Chinese and English word segmentation and search
-
tokenx-rs
Fast token count estimation for LLMs at 96% accuracy without a full tokenizer
-
udled
Tokenizer and parser
-
sqlite-charabia-tokenizer
This's a run-time loadable extension of SQLite fts5, supports Chinese and English word segmentation and search
-
limbo_sqlite3_parser
SQL parser (as understood by SQLite)
-
flat-cli
Flatten codebases into AI-friendly format
-
lindera-tokenizer
A morphological analysis library
-
language-tokenizer
Text tokenizer for linguistic purposes, such as text matching. Supports more than 40 languages, including English, French, Russian, Japanese, Thai etc.
-
makepad-live-tokenizer
Makepad platform live DSL tokenizer
-
rustbpe
A BPE (Byte Pair Encoding) tokenizer written in Rust with Python bindings
-
roketok
way to simply set up a tokenizer and use it. Not recommended for simple tokenizers as this crate adds a bunch of stuff to support many if not all kinds of tokenizers
-
axonml-text
Text processing utilities for the Axonml ML framework
-
toktrie_hf_downloader
HuggingFace Hub download library support for toktrie and llguidance
-
mipl
Minimal Imperative Parsing Library
-
tekken-rs
Mistral Tekken tokenizer with audio support
-
mecab-ko-core
한국어 형태소 분석 핵심 엔진 - Lattice, Viterbi, 토크나이저
-
quicktok
Minimal, fast, multi-threaded implementation of the Byte Pair Encoding (BPE) for LLM tokenization
-
token-dict
basic dictionary based tokenization
-
punkt
sentence tokenizer
-
nenyr
initial version of the Nenyr parser delivers robust foundational capabilities for interpreting Nenyr syntax. It intelligently processes central, layout, and module contexts, handling complex variable…
-
marukov
markov chain text generator
-
segtok
Sentence segmentation and word tokenization tools
-
toktrie_tiktoken
HuggingFace tokenizers library support for toktrie and llguidance
-
rusqlite-ext
Rusqlite extension for building the FTS5 tokenizer
-
fuzzy-pickles
A low-level parser of Rust source code with high-level visitor implementations
-
divvunspell
Spell checking library for ZHFST/BHFST spellers, with case handling and tokenization support
-
notmecab
tokenizing text with mecab dictionaries. Not a mecab wrapper.
-
derive-finite-automaton
Procedural macro for generating finite automaton
-
indent_tokenizer
Generate tokens based on indentation
-
s-expression
parser
-
lang_pt
A parser tool to generate recursive descent top down parser
-
langbox
framework to build compilers and interpreters
-
kohaku
tokenizer
-
tokenizer-lib
Tokenization utilities for building parsers in Rust
-
unitoken
Fast BPE tokenizer/trainer with a Rust core and Python bindings
-
kitoken
Fast and versatile tokenizer for language models, supporting BPE, Unigram and WordPiece tokenization
-
pinyinchch-type
一个拼音转汉字的工具库
-
lens
Unified Lens query language
-
bpe-tokenizer
A BPE Tokenizer library
-
ragit-korean
korean tokenizer for ragit
-
divvunspell-bin
Spellchecker for ZHFST/BHFST spellers, with case handling and tokenization support
-
vaporetto_rules
Rule-base filters for Vaporetto
-
vaporetto_tantivy
Vaporetto Tokenizer for Tantivy
-
rten-text
Text tokenization and other ML pre/post-processing functions
-
chinese_segmenter
Tokenize Chinese sentences using a dictionary-driven largest first matching approach
-
palate_polyglot_tokenizer
A generic programming language tokenizer
-
tokenizers-enfer
today's most used tokenizers, with a focus on performances and versatility
-
tokenmonster
Greedy tiktoken-like tokenizer with embedded vocabulary (cl100k-base approximator)
-
sentencepiece-sys
Binding for the sentencepiece tokenizer
-
mt_mtc
Tokenizer and parser for the Minot language
-
skimmer
streams reader
-
ellie_tokenizer
Tokenizer for ellie language
-
irg-kvariants
wrapper around kvariant from hfhchan/irg
-
vtext
NLP with Rust
-
instant-clip-tokenizer
Fast text tokenizer for the CLIP neural network
-
alith-models
Load and Download LLM Models, Metadata, and Tokenizers
-
rtf-grimoire
A Rich Text File (RTF) document tokenizer. Useful for writing RTF parsers.
-
tokenizer
Thai text tokenizer
-
izihawa-tantivy-tokenizer-api
Tokenizer API of tantivy
-
wordpieces
Split tokens into word pieces
-
punkt_n
Punkt sentence tokenizer
-
daft-functions-tokenize
Tokenization functions for the Daft project
-
mako
main Sidekick AI data processing library
-
gtars-tokenizers
Genomic region tokenizers for machine learning in Rust
-
cang-jie
A Chinese tokenizer for tantivy
-
rustpostal
Rust bindings to libpostal
-
sixel-tokenizer
A tokenizer for serialized Sixel bytes
-
toresy
term rewriting system based on tokenization
-
specmc-base
common code for parsing Minecraft specification
-
tinytoken
tokenizing text into words, numbers, symbols, and more, with customizable parsing options
-
sentencepiece-model
SentencePiece model parser generated from the SentencePiece protobuf definition
-
simple-tokenizer
A tiny no_std tokenizer with line & column tracking
-
yes-lang
Scripting Language
-
rust_transformers
High performance tokenizers for Rust
-
alpino-tokenizer
Wrapper around the Alpino tokenizer for Dutch
-
syntaxdot-tokenizers
Subword tokenizers
-
neca-cmd
command tokenizer used by my Twitch chat bot
-
boost_tokenizer
Boost C++ library boost_tokenizer packaged using Zanbil
-
procedural-masquarade
Incorrect spelling for procedural-masquerade
-
tele_tokenizer
A CSS tokenizer
-
tiniestsegmenter
Compact Japanese segmenter
-
ccgen
generate manually maintained C (and C++) headers
-
tokengeex
efficient tokenizer for code based on UnigramLM and TokenMonster
-
data_vault
Data Vault is a modular, pragmatic, credit card vault for Rust
-
paltoquet
rule-based general-purpose tokenizers
-
uscan
A universal source code scanner
-
sentence
tokenizes English language sentences for use in TTS applications
-
earl-lang-syntax
tokenizer and parser for the language Earl
-
tinysegmenter
Compact Japanese tokenizer
-
aleph-alpha-tokenizer
A fast implementation of a wordpiece-inspired tokenizer
-
caddyfile
working with Caddy's Caddyfile format
-
castle_tokenizer
Castle Tokenizer: tokenizer
-
strizer
minimal and fast library for text tokenization
-
colorblast
Syntax highlighting library for various programming languages, markup languages and various other formats
-
regex-bnf
A deterministic parser for a BNF inspired syntax with regular expressions
-
indentation_flattener
From indented input, generate plain output with indentation PUSH and POP codes
-
nipah_tokenizer
A powerful yet simple text tokenizer for your everyday needs!
-
json-parser
JSON parser
-
gtokenizers
tokenizing genomic data with an emphasis on region set data
-
basic_lexer
Basic lexical analyzer for parsing and compiling
-
tokeneer
tokenizer crate
-
xtoken
Iterator based no_std XML Tokenizer using memchr
-
blingfire
Wrapper for the BlingFire tokenization library
-
rust-lexer
A compiler that generates a Lexer using DFAs (inspired by flex)
-
llm_models
Load and download LLM models, metadata, and tokenizers
-
reflex
A minimal flex-like lexer
-
morsels_lang_ascii
Basic ascii tokenizer for morsels
-
regex-tokenizer
A regex tokenizer
-
alpino-tokenize
Wrapper around the Alpino tokenizer for Dutch
-
pretok
A string pre-tokenizer for C-like syntaxes
-
scanny
A advanced text scanning library for Rust
-
crossandra
A straightforward tokenization library for seamless text processing
-
morsels_lang_chinese
Chinese tokenizer for morsels
-
text-scanner
A UTF-8 char-oriented, zero-copy, text and code scanning library
-
quote-data
A tokenization Library for Rust
-
tokenate
do some grunt work of writing a tokenizer
-
tokenize_dir
Tokenize file names in directories to access files in a composable way
-
liendl_tokenizer
BPE tokenizer for Rust
-
saku
efficient rule-based Japanese Sentence Tokenizer
-
sylt-tokenizer
Tokenizer for the Sylt programming language
-
tkn-cli
TKN: Quick Tokenizing in the terminal
-
token-iter
that simplifies writing tokenizers
-
any-lexer
Lexers for various programming languages and formats
-
rustpotion
Blazingly fast word embeddings with Tokenlearn
-
rs_html_parser_tokenizer
Rs Html Parser Tokenizer
-
hemtt-tokens
A token library for hemtt
-
tocken
Clustering algorithms
-
khmercut
A blazingly fast Khmer word segmentation tool written in Rust
-
rs_html_parser
Rs Html Parser
-
polyglot_tokenizer
A generic programming language tokenizer
-
brack-tokenizer
The tokenizer for the Brack programming language
-
cssparser-macros
Procedural macros for cssparser
Try searching with DuckDuckGo.