Lib.rs

› Keywords #bpe #hugging-face #llm #lexer #word-piece #japanese #parser

#tokenize

Keyword
Search

tokenizers

today's most used tokenizers, with a focus on performances and versatility

v0.22.2 993K #tokenize #hugging-face #word-piece #bpe #tokenizer
svgtypes

SVG types parser

v0.16.1 1.3M #svg-parser #tokenize #svg
xmlparser

Pull-based, zero-allocation XML parser

v0.13.6 5.4M #tokenize #xml #tokenizer
markdown

CommonMark compliant markdown parser in Rust with ASTs and extensions

v1.0.0 262K #markdown-parser #render-markdown #common-mark #tokenize
charabia

detect the language, tokenize the text and normalize the tokens

v0.9.9 43K #tokenize #normalize #tokenizer
html5gum

A WHATWG-compliant HTML5 tokenizer and tag soup parser

v0.8.3 40K #html-parser #tokenize #whatwg #html5 #html #tokenizer
sqlite3-parser

SQL parser (as understood by SQLite)

v0.15.0 107K #sql-parser #tokenize #sql
htmlparser

Pull-based, zero-allocation HTML parser

v0.2.1 62K #tokenize #html-parsing #tokenizer
styx-tokenizer

Tokenizer for the Styx configuration language

v1.0.0 #styx #configuration-language #tokenize #document
llm-tokenizer

LLM tokenizer library with caching and chat template support

v1.1.0 24K #tokenize #hugging-face #llm #tiktoken #chat-template #tokenizer
lindera-tantivy

Lindera Tokenizer for Tantivy

v2.0.0 6.6K #tokenize #lindera #tantivy #tokenizer
sentencepiece

Binding for the sentencepiece tokenizer

v0.13.1 10K #tokenize #text-tokenizer #bindings
azul-simplecss

A very simple CSS 2.1 tokenizer with CSS nesting support

v0.2.0 7.0K #tokenize #css-parser #css #nested #tokenizer
vaporetto

pointwise prediction based tokenizer

v0.6.5 9.4K #tokenize #japanese #analyzer #morphological #tokenizer
erl_tokenize

Erlang source code tokenizer

v0.10.0 1.4K #lexer-tokenizer #erlang #lexer #tokenize #tokenizer
bpe

Fast byte-pair encoding implementation

v0.2.1 6.9K #tokenize #encoding #algorithm #tokenizer
momoa

A JSON parsing library suitable for static analysis

v3.2.5 #ast #json-parser #static-analysis #tokenize
octofhir-fhirpath-parser

Parser and tokenizer for FHIRPath expressions

v0.4.20 1.0K #tokenize #parser #fhir #tokenizer #fhirpath
splintr

Fast Rust BPE tokenizer with Python bindings

v0.8.0 #tokenize #llm #tiktoken #bpe #gpt #tokenizer
vibrato

viterbi-based accelerated tokenizer

v0.5.2 3.3K #tokenize #japanese #tokenizer
tokstream-cli

CLI token stream simulator using Hugging Face tokenizers

v0.1.2 #tokenize #streaming #tokenizer #cli
toktrie_hf_tokenizers

HuggingFace tokenizers library support for toktrie and llguidance

v1.5.1 70K #structured-output #tokenize #llguidance #toktrie #model #hugging-face #json-schema #context-free-grammar #llama-cpp
pred-recdec

Predicated Recursive Descent Parsing with BNF and impure hooks

v0.2.1 #ast #recursion-descent-parser #grammar #bnf #tokenize #ll-parser #recursive-descent #regex #token-stream #lookahead
rwkv-tokenizer

A fast RWKV Tokenizer

v0.9.1 900 #rwkv #tokenize #model #testing #world
kanpyo-dict

Dictionary Library for Kanpyo

v0.2.0 #kanpyo #japanese #dictionary #tokenize #analyzer
huggingface/tokenizers-python

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

GitHub 0.22.3-dev.0 #tokenize #bert #word-piece #language-model #training #byte-level #bpe #pad #state-of-the-art #py
vibrato-rkyv

Vibrato: viterbi-based accelerated tokenizer with rkyv support for fast dictionary loading

v0.7.3 #tokenize #japanese #morphological #analyzer #tokenizer
pinyinchch

一个拼音转汉字的工具库

v0.2.0 #chinese #pinyin #tokenize #hanzi
bleuscore

A fast bleu score calculator

v0.1.6 #tokenize #bleu #deep-learning #tokenizer
lindera-python

A morphological analysis libraries and command line interface

v2.2.0 #morphological-analysis #tokenize #dictionary #morphological
unscanny

Painless string scanning

v0.1.0 1.6M #tokenize #scanning #tokenizing
jayce

tokenizer 🌌

v12.1.0 1.1K #tokenize #duo #matching #sync #generic-simd
toak-ocr

OCR engine with Apple Vision framework support for macOS

v6.0.0 #macos-framework #tokenize #toak #prompt #repository #ocr #markdown #vision-framework #git #sensitive-information
tergo-formatter

Formatter for tergo

v0.2.11 500 #tergo #formatter #tokenize #aqua #directory #latin #code-formatter
tergo-parser

Parser for tergo

v0.2.6 340 #tergo #formatting #parser #tokenize #line #latin
tergo-tokenizer

R language tokenizer

v0.2.5 300 #tokenize #formatting #tergo #language #line #latin #aqua #code-formatter
europa

A lightweight AI utilities library for Rust

v0.0.3 #artificial-intelligence #vector-embedding #vector-search #tokenize #utilities #text-embedding #cosine-similarity #euclidean-distance #semantic-search #vector-database
unobtanium-segmenter

A text segmentation toolbox for search applications inspired by charabia and tantivy

v0.5.1 #tokenize #text-segmentation #language #tokenizer
go-brrr

Token-efficient code analysis for LLMs - Rust implementation

v0.1.0 #tree-sitter #code-analysis #tokenize #ast #ast-analysis #llm
nlpo3

Thai natural language processing library, with Python and Node bindings

v1.4.0 1.0K #tokenize #thai #word-segmentation #tokenizer
sqlite-simple-tokenizer

This's a run-time loadable extension of SQLite fts5, supports Chinese and pinyin word segmentation and search

v0.5.0 #sqlite-extension #tokenize #pinyin #sqlite #chinese
rust_tokenizers

High performance tokenizers for Rust

v8.1.1 6.0K #tokenize #machine-learning #tokenizer
mkcontext

that provides functionality for creating context

v0.7.3 800 #glob-pattern #generator #tokenize #context #limit #exclude #chatgpt
libsql-sqlite3-parser

SQL parser (as understood by SQLite) (libsql fork)

v0.13.0 45K #sql-parser #tokenize #sql #tokenizer
cuttle

A large language model inference engine in Rust

v0.1.1 #inference-engine #language-model #model-inference #tokenize #tensor #text-generation #performance-monitoring #benchmark
turso_sqlite3_parser

SQL parser (as understood by SQLite)

v0.2.0-pre.7 1.4K #tokenize #sql-parser #sql #parser
tokstream-core

Core tokenizer streaming engine for tokstream

v0.1.2 #tokenize #simulation #streaming #tokenizer
tantivy-tokenizer-api

Tokenizer API of tantivy

v0.6.0 543K #tantivy #tokenize #full-text-search #text-indexing #tokenizer-in-charge
text-tokenizer

Custom text tokenizer

v0.6.16 #tokenize #text
rust-forth-tokenizer

A Forth tokenizer written in Rust

v0.2.1 480 #tokenize #forth #tokenizer
trustformers-tokenizers

Tokenizers for TrustformeRS

v0.1.0-alpha.2 #tokenize #word-piece #bpe #tokenizer #nlp-processing
scanlex

lexical scanner for parsing text into tokens

v0.1.4 259K #tokenize #input #tokenize-text #scan
avila-tokenizers

The most complete tokenizer library in Rust - BPE, WordPiece, Unigram, with native support for GPT, BERT, Llama, Claude

v0.1.0 #tokenize #bert #llm #nlp #gpt #tokenizer
syn_derive

Derive macros for syn::Parse and quote::ToTokens

v0.2.0 359K #macro-derive #to-tokens #quote #parser #tokenize #macro-parser
lexers

Tools for tokenizing and scanning

v0.1.4 300 #lexer-tokenizer #ebnf #lexer #tokenize #tokenizer
sqlite-jieba-tokenizer

This's a run-time loadable extension of SQLite fts5, supports Chinese and English word segmentation and search

v0.5.0 #sqlite-extension #tokenize #sqlite #chinese #english
tokenx-rs

Fast token count estimation for LLMs at 96% accuracy without a full tokenizer

v0.1.0 #tokenize #llm #claude #gpt #tokenizer
udled

Tokenizer and parser

v0.6.2 #tokenize #lexer #parser
sqlite-charabia-tokenizer

This's a run-time loadable extension of SQLite fts5, supports Chinese and English word segmentation and search

v0.5.0 #sqlite-extension #tokenize #sqlite #charabia
limbo_sqlite3_parser

SQL parser (as understood by SQLite)

v0.0.22 550 #sql-parser #tokenize #sql #parser
flat-cli

Flatten codebases into AI-friendly format

v0.4.0 #artificial-intelligence #tokenize #flatten
lindera-tokenizer

A morphological analysis library

v0.32.3 8.2K #morphological-analysis #tokenize #tokenizer
language-tokenizer

Text tokenizer for linguistic purposes, such as text matching. Supports more than 40 languages, including English, French, Russian, Japanese, Thai etc.

v0.1.0 #tokenize #language #text-tokenizer #tokenizer
makepad-live-tokenizer

Makepad platform live DSL tokenizer

v1.0.0 500 #tokenize #makepad #platform #dsl #live #wasm
rustbpe

A BPE (Byte Pair Encoding) tokenizer written in Rust with Python bindings

v0.1.0 #python-bindings #tokenize #training #bpe #byte-pair #tiktoken #gpt-4 #regex
roketok

way to simply set up a tokenizer and use it. Not recommended for simple tokenizers as this crate adds a bunch of stuff to support many if not all kinds of tokenizers

v0.3.1 490 #tokenize #setup #focused #kinds #not-recommended #warnings
axonml-text

Text processing utilities for the Axonml ML framework

v0.2.8 #dataset #tokenize #ngrams #axonml #language-modeling #vocabulary #synthetic #sequence-to-sequence #white-space #unigram
toktrie_hf_downloader

HuggingFace Hub download library support for toktrie and llguidance

v1.5.1 #structured-output #llguidance #tokenize #openai #hugging-face #context-free-grammar #json-schema #toktrie #llama #llama-cpp
mipl

Minimal Imperative Parsing Library

v0.2.1 #tokenize #token-stream #tokenizer #parser
tekken-rs

Mistral Tekken tokenizer with audio support

v0.1.1 320 #tokenize #artificial-intelligence #mistral #audio #nlp #tokenizer
mecab-ko-core

한국어 형태소 분석 핵심 엔진 - Lattice, Viterbi, 토크나이저

v0.1.0 #tokenize #korean #viterbi #nlp #morphology #tokenizer
quicktok

Minimal, fast, multi-threaded implementation of the Byte Pair Encoding (BPE) for LLM tokenization

v0.2.0 #byte-pair #multi-threading #bpe #tokenize #llm
token-dict

basic dictionary based tokenization

v1.0.0 #tokenize #dictionary #text-tokenization #split
punkt

sentence tokenizer

v1.0.5 #tokenize #sentence #tokenizer
nenyr

initial version of the Nenyr parser delivers robust foundational capabilities for interpreting Nenyr syntax. It intelligently processes central, layout, and module contexts, handling complex variable…

v1.0.0-beta.1 370 #domain-specific-language #css #css-framework #themes #tokenize #seamless-integration #breakpoints #context-aware #animation #dsl
marukov

markov chain text generator

v0.0.2 #markov-chain #generator #text-generator #text-generation #tokenize #generations
segtok

Sentence segmentation and word tokenization tools

v0.1.5 11K #tokenize #split #tokenizer #word
toktrie_tiktoken

HuggingFace tokenizers library support for toktrie and llguidance

v1.5.0 #tokenize #structured-output #llguidance #openai #json-schema #hugging-face #context-free-grammar #toktrie #llama #llama-cpp
rusqlite-ext

Rusqlite extension for building the FTS5 tokenizer

v0.38.0 #sqlite-extension #tokenize #rusqlite #sqlite
fuzzy-pickles

A low-level parser of Rust source code with high-level visitor implementations

v0.1.1 #tokenize #rust #tokenizer
divvunspell

Spell checking library for ZHFST/BHFST spellers, with case handling and tokenization support

v1.0.0-beta.3 #spell-check #tokenize #zhfst #suggestions #archive #bhfst #memory-map #hfst-ospell #alphabet #fst
notmecab

tokenizing text with mecab dictionaries. Not a mecab wrapper.

v0.5.1 #tokenize #mecab #dictionary #clone #version #utf-8 #tokenize-text
derive-finite-automaton

Procedural macro for generating finite automaton

v0.3.0 #finite-automata #tokenize #tokenization
indent_tokenizer

Generate tokens based on indentation

v0.4.0 #indentation #tokenize #tokenizer
s-expression

parser

v0.2.0 #expression-parser #s-expr #zero-copy-parser #tokenize #borrowing #preallocated #numbers-parser #parser-compiler #performance-optimization #interpreter
lang_pt

A parser tool to generate recursive descent top down parser

v0.1.2 #top-down-parser #tokenize #recursive-descent #parser
langbox

framework to build compilers and interpreters

v0.6.0 440 #lexer #lexer-tokenizer #parser-combinator #tokenize
kohaku

tokenizer

v0.1.5 #tokenize #abc #ok
tokenizer-lib

Tokenization utilities for building parsers in Rust

v1.6.0 100 #tokenize #parser #tokenization
unitoken

Fast BPE tokenizer/trainer with a Rust core and Python bindings

v0.1.1 #tokenize #bpe #nlp #tokenizer
kitoken

Fast and versatile tokenizer for language models, supporting BPE, Unigram and WordPiece tokenization

v0.10.1 2.5K #tokenize #word-piece #unigram #bpe #tokenizer
pinyinchch-type

一个拼音转汉字的工具库

v0.2.0 #chinese #pinyin #tokenize #hanzi
lens

Unified Lens query language

v0.2.0 #query-language #query-engine #structured-data #expression #tokenize #navigating
bpe-tokenizer

A BPE Tokenizer library

v0.1.4 150 #byte-pair #tokenize #bpe #encoding #byte
ragit-korean

korean tokenizer for ragit

v0.4.5 #korean #tokenize #ragit #document #convert
divvunspell-bin

Spellchecker for ZHFST/BHFST spellers, with case handling and tokenization support

v1.0.0 #spell-check #tokenize #archive #zhfst #divvunspell #bhfst #suggestions #hfst-ospell #morphological-analysis
vaporetto_rules

Rule-base filters for Vaporetto

v0.6.5 850 #japanese #tokenize #morphological #analyzer #tokenizer
vaporetto_tantivy

Vaporetto Tokenizer for Tantivy

v0.24.0 1.8K #tantivy #tokenize #japanese
rten-text

Text tokenization and other ML pre/post-processing functions

v0.24.0 #tokenize #token-id #text-tokenization #text-tokenizer #hugging-face #post-processing #bert #transformer-models #eg #machine-learning
chinese_segmenter

Tokenize Chinese sentences using a dictionary-driven largest first matching approach

v1.0.1 #chinese #tokenize #hanzi #segment #localization
palate_polyglot_tokenizer

A generic programming language tokenizer

v0.2.1 #tokenize #generics #polyglot #programming-language #line-comment #block-comment
tokenizers-enfer

today's most used tokenizers, with a focus on performances and versatility

v0.21.1 #tokenize #hugging-face #word-piece #bpe #tokenizer
tokenmonster

Greedy tiktoken-like tokenizer with embedded vocabulary (cl100k-base approximator)

v0.1.0 #tokenize #tiktoken #nlp #tokenizer
sentencepiece-sys

Binding for the sentencepiece tokenizer

v0.13.1 6.2K #bindings #tokenize #text-tokenizer #dynamic-linking #version #pkg-config #build-script
mt_mtc

Tokenizer and parser for the Minot language

v0.5.2 #robotics #tokenize #minot
skimmer

streams reader

v0.0.3 #stream-reader #byte-stream #tokenize
ellie_tokenizer

Tokenizer for ellie language

v0.7.3 #tokenize #ellie #embedded #item #language
irg-kvariants

wrapper around kvariant from hfhchan/irg

v0.1.1 39K #kvariant #kvariants #irg #hfhchan #tokenize
vtext

NLP with Rust

v0.2.0 #tf-idf #tokenize #levenshtein #text-processing
instant-clip-tokenizer

Fast text tokenizer for the CLIP neural network

v0.1.0 4.5K #neural-network-clip #tokenize #text-tokenizer #model #instant #python-bindings
alith-models

Load and Download LLM Models, Metadata, and Tokenizers

v0.4.3 #gguf #model #tokenize #hugging-face #metadata #llm #embedding #artificial-intelligence
rtf-grimoire

A Rich Text File (RTF) document tokenizer. Useful for writing RTF parsers.

v0.2.1 #rtf #rich-text #tokenize
tokenizer

Thai text tokenizer

v0.1.2 #tokenize #localization #thai #text-tokenizer #tokeniser
izihawa-tantivy-tokenizer-api

Tokenizer API of tantivy

v0.25.0 500 #tokenize #tantivy #full-text-search #tokenizer-in-charge #text-indexing #search-engine

Try searching with DuckDuckGo.

wordpieces

Split tokens into word pieces

v0.6.1 110 #word-piece #tokenize #piece #wordpiece #word
punkt_n

Punkt sentence tokenizer

v1.0.5 #tokenize #sentence #punkt #tokenizer
daft-functions-tokenize

Tokenization functions for the Daft project

v0.1.0 #tokenize #daft #functions-for-daft
mako

main Sidekick AI data processing library

v0.3.0 #artificial-intelligence #data-processing #tokenize #data-loader #machine-learning #batch-processing #dataflow #sidekick #tokenized
gtars-tokenizers

Genomic region tokenizers for machine learning in Rust

v0.5.2 #machine-learning #tokenize #genomics #gtars #region #overlap #genomic-region #genomic-data
cang-jie

A Chinese tokenizer for tantivy

v0.18.0 #tokenize #tantivy #chinese #search
rustpostal

Rust bindings to libpostal

v0.3.0 #libpostal #address-parser #street-address #expand #tokenize #street-addresses
sixel-tokenizer

A tokenizer for serialized Sixel bytes

v0.1.0 3.9K #sixel #tokenize #byte #serialization #events #coordinate-system
toresy

term rewriting system based on tokenization

v0.5.0 #tokenize #system #rewriting #rewriting-rules #input #text-output
specmc-base

common code for parsing Minecraft specification

v0.1.11 490 #minecraft #tokenize #specification #identifier #base #common-code
tinytoken

tokenizing text into words, numbers, symbols, and more, with customizable parsing options

v0.1.4 130 #tokenize #numbers #text-input #tokenizer
sentencepiece-model

SentencePiece model parser generated from the SentencePiece protobuf definition

v0.1.4 7.4K #sentence-piece #tokenize #machine-learning
simple-tokenizer

A tiny no_std tokenizer with line & column tracking

v0.4.2 250 #tokenize #column #no-alloc
yes-lang

Scripting Language

v0.1.0 #scripting #repl #tokenize #infix #prefix #lexer #interpreter #type-safety #multi-line #loc
rust_transformers

High performance tokenizers for Rust

v0.2.0 #tokenize #transformer-models #bert #byte-pair #py #bpe #word-piece #integration-tests #rust-nightly
alpino-tokenizer

Wrapper around the Alpino tokenizer for Dutch

v0.4.0 #tokenize #finite-state-transducer #dutch #alpino #principles #text-tokenizer
syntaxdot-tokenizers

Subword tokenizers

v0.5.0 #syntax-dot #tokenize #sentence-piece #labeling #bert #subword #lemmatization #biaffine-parser #word-piece #morphology
neca-cmd

command tokenizer used by my Twitch chat bot

v0.3.0 250 #tokenize #chat-bot #twitch
boost_tokenizer

Boost C++ library boost_tokenizer packaged using Zanbil

v0.1.0 #tokenize #boost #zanbil #packaged #io-stream #badge
procedural-masquarade

Incorrect spelling for procedural-masquerade

v0.2.0 #css #css-parser #tokenize #detect #level #spelling #incorrect #character-encoding #syntax-tree
tele_tokenizer

A CSS tokenizer

v0.2.0 #tokenize #css #telecss #tokenizer
tiniestsegmenter

Compact Japanese segmenter

v0.3.0 750 #japanese #tokenize #ngrams
ccgen

generate manually maintained C (and C++) headers

v0.2.0 #header #generate #tok #generator #tokenize
tokengeex

efficient tokenizer for code based on UnigramLM and TokenMonster

v1.1.0 900 #tokenize #llm #codegeex #tokenizer
data_vault

Data Vault is a modular, pragmatic, credit card vault for Rust

v0.3.4 #credit-card #encryption #credits #vault #tokenize #aes-gcm-siv #postgresql #blake3 #redis #encryption-key
paltoquet

rule-based general-purpose tokenizers

v0.11.0 340 #tokenize #rule-based
uscan

A universal source code scanner

v0.1.3 #tokenize #compiler #tokenizer
sentence

tokenizes English language sentences for use in TTS applications

v0.0.2 #tokenize #english
earl-lang-syntax

tokenizer and parser for the language Earl

v1.0.0 #tokenize #syntax #s-expr #language-syntax #multi-line #syntax-parser
tinysegmenter

Compact Japanese tokenizer

v0.1.1 1.3K #japanese #tokenize #compact
aleph-alpha-tokenizer

A fast implementation of a wordpiece-inspired tokenizer

v0.3.1 #tokenize #aleph-alpha #nlp #tokenizer
caddyfile

working with Caddy's Caddyfile format

v0.1.1 750 #tokenize #format #caddy #testing #github
castle_tokenizer

Castle Tokenizer: tokenizer

v0.20.2 180 #tokenize #castle
strizer

minimal and fast library for text tokenization

v0.1.0 #text-tokenization #tokenize #string-tokenizer
colorblast

Syntax highlighting library for various programming languages, markup languages and various other formats

v0.0.3 #syntax-highlighting #tokenize #highlighter
regex-bnf

A deterministic parser for a BNF inspired syntax with regular expressions

v0.1.2 #regex #bnf #syntax #tokenize #syntax-parser #grammar #grammar-parser #csv-parser
indentation_flattener

From indented input, generate plain output with indentation PUSH and POP codes

v0.1.0 #indentation #tokenize #parser
nipah_tokenizer

A powerful yet simple text tokenizer for your everyday needs!

v0.1.0 #tokenize #text-tokenizer #nlp #tokenizer
json-parser

JSON parser

v1.0.2 #tokenize #json #tokenizer
gtokenizers

tokenizing genomic data with an emphasis on region set data

v0.0.18 #genomics #genomic-data #tokenize #region #machine-learning #emphasis
basic_lexer

Basic lexical analyzer for parsing and compiling

v0.2.1 #tokenize #lexical-analysis #white-space #tokenizer
tokeneer

tokenizer crate

v0.1.0 340 #tokenize #bpe #tokenizer
xtoken

Iterator based no_std XML Tokenizer using memchr

v0.1.1 #tokenize #iterator #xml #memchr #byte-slice
blingfire

Wrapper for the BlingFire tokenization library

v1.0.0 1.3K #tokenize #machine-learning #tokenizer
rust-lexer

A compiler that generates a Lexer using DFAs (inspired by flex)

v0.2.0 #compiler #dfa #generator #tokenize #flex #output-file #input-file
llm_models

Load and download LLM models, metadata, and tokenizers

v0.0.3 310 #gguf #model #tokenize #metadata #llm #artificial-intelligence #candle
reflex

A minimal flex-like lexer

v0.1.2 #lexer #tokenize #flex-like
morsels_lang_ascii

Basic ascii tokenizer for morsels

v0.7.3 #ascii #language #morsels #tokenize #tokenizer-for-morsels
regex-tokenizer

A regex tokenizer

v0.1.1 #tokenize #regex #tokenizer
alpino-tokenize

Wrapper around the Alpino tokenizer for Dutch

v0.4.0 #tokenize #finite-state-transducer #alpino-tokenizer #dutch #command-line-tool
pretok

A string pre-tokenizer for C-like syntaxes

v0.1.0 #lexer-tokenizer #lexer #text #tokenize #tokenizer
scanny

A advanced text scanning library for Rust

v0.1.0 #tokenize #lexical-token #tokenizer #parser
crossandra

A straightforward tokenization library for seamless text processing

v0.0.2 #tokenize #tokenization-for-seamless #regex #tokenize-text #processing #lexer
morsels_lang_chinese

Chinese tokenizer for morsels

v0.7.3 100 #chinese #morsels #language #tokenize #tokenizer-for-morsels
text-scanner

A UTF-8 char-oriented, zero-copy, text and code scanning library

v0.0.3 #lexer #tokenize #streaming-parser
quote-data

A tokenization Library for Rust

v1.0.0 #proc-macro #tokenize #macro-derive #struct #quote
tokenate

do some grunt work of writing a tokenizer

v0.1.0 #tokenize #inner #parse
tokenize_dir

Tokenize file names in directories to access files in a composable way

v0.1.0 #filenames #file-access #directory #tokenize #composable
liendl_tokenizer

BPE tokenizer for Rust

v0.1.0 #tokenize #training #vocabulary #character #model #tokenize-text #bpe #csv #different-versions #convert-text
saku

efficient rule-based Japanese Sentence Tokenizer

v0.1.6 #japanese #tokenize #sentence #split #rule-based
sylt-tokenizer

Tokenizer for the Sylt programming language

v0.1.0 #sylt #tokenize #programming-language #rc #dynamically-typed #game
tkn-cli

TKN: Quick Tokenizing in the terminal

v0.1.1 #tokenize #cli #productivity #cli-productivity
token-iter

that simplifies writing tokenizers

v0.1.0 #tokenize #language #textual #read
any-lexer

Lexers for various programming languages and formats

v0.0.3 #tokenize #lexer #streaming-parser
rustpotion

Blazingly fast word embeddings with Tokenlearn

v0.3.0 #tokenize #embedding #rag #model2vec #tokenizer
rs_html_parser_tokenizer

Rs Html Parser Tokenizer

v0.0.10 #html-parser #tokenize #browser #handle #tags #parser-error #processing-instructions #closing #case-insensitive #notes
hemtt-tokens

A token library for hemtt

v1.0.0 #hemtt #tokenize #arma
tocken

Clustering algorithms

v0.1.0 150 #machine-learning #tokenize #vector-search #text-tokenizer #text
khmercut

A blazingly fast Khmer word segmentation tool written in Rust

v0.1.5 #word-segmentation #tokenize #khmer #tool #cargo-run
rs_html_parser

Rs Html Parser

v0.0.10 #html-parser #tokenize #browser #tags #processing-instructions
polyglot_tokenizer

A generic programming language tokenizer

v0.2.1 2.7K #tokenize #polyglot #generics #programming-language #line-comment #block-comment
brack-tokenizer

The tokenizer for the Brack programming language

v0.1.0 #tokenize #brack
cssparser-macros

Procedural macros for cssparser

v0.6.1 1.3M #css-parser #proc-macro #tokenize #byte #level

Search powered by tantivy. The index is a combination of multiple data sources and heuristics, not just pure crate metadata.

Browse all categories.