#tokenize

  1. tokenizers

    today's most used tokenizers, with a focus on performances and versatility

    v0.22.2 993K #tokenize #hugging-face #word-piece #bpe #tokenizer
  2. svgtypes

    SVG types parser

    v0.16.1 1.3M #svg-parser #tokenize #svg
  3. xmlparser

    Pull-based, zero-allocation XML parser

    v0.13.6 5.4M #tokenize #xml #tokenizer
  4. markdown

    CommonMark compliant markdown parser in Rust with ASTs and extensions

    v1.0.0 262K #markdown-parser #render-markdown #common-mark #tokenize
  5. charabia

    detect the language, tokenize the text and normalize the tokens

    v0.9.9 43K #tokenize #normalize #tokenizer
  6. html5gum

    A WHATWG-compliant HTML5 tokenizer and tag soup parser

    v0.8.3 40K #html-parser #tokenize #whatwg #html5 #html #tokenizer
  7. sqlite3-parser

    SQL parser (as understood by SQLite)

    v0.15.0 107K #sql-parser #tokenize #sql
  8. htmlparser

    Pull-based, zero-allocation HTML parser

    v0.2.1 62K #tokenize #html-parsing #tokenizer
  9. styx-tokenizer

    Tokenizer for the Styx configuration language

    v1.0.0 #styx #configuration-language #tokenize #document
  10. llm-tokenizer

    LLM tokenizer library with caching and chat template support

    v1.1.0 24K #tokenize #hugging-face #llm #tiktoken #chat-template #tokenizer
  11. lindera-tantivy

    Lindera Tokenizer for Tantivy

    v2.0.0 6.6K #tokenize #lindera #tantivy #tokenizer
  12. sentencepiece

    Binding for the sentencepiece tokenizer

    v0.13.1 10K #tokenize #text-tokenizer #bindings
  13. azul-simplecss

    A very simple CSS 2.1 tokenizer with CSS nesting support

    v0.2.0 7.0K #tokenize #css-parser #css #nested #tokenizer
  14. vaporetto

    pointwise prediction based tokenizer

    v0.6.5 9.4K #tokenize #japanese #analyzer #morphological #tokenizer
  15. erl_tokenize

    Erlang source code tokenizer

    v0.10.0 1.4K #lexer-tokenizer #erlang #lexer #tokenize #tokenizer
  16. bpe

    Fast byte-pair encoding implementation

    v0.2.1 6.9K #tokenize #encoding #algorithm #tokenizer
  17. momoa

    A JSON parsing library suitable for static analysis

    v3.2.5 #ast #json-parser #static-analysis #tokenize
  18. octofhir-fhirpath-parser

    Parser and tokenizer for FHIRPath expressions

    v0.4.20 1.0K #tokenize #parser #fhir #tokenizer #fhirpath
  19. splintr

    Fast Rust BPE tokenizer with Python bindings

    v0.8.0 #tokenize #llm #tiktoken #bpe #gpt #tokenizer
  20. vibrato

    viterbi-based accelerated tokenizer

    v0.5.2 3.3K #tokenize #japanese #tokenizer
  21. tokstream-cli

    CLI token stream simulator using Hugging Face tokenizers

    v0.1.2 #tokenize #streaming #tokenizer #cli
  22. toktrie_hf_tokenizers

    HuggingFace tokenizers library support for toktrie and llguidance

    v1.5.1 70K #structured-output #tokenize #llguidance #toktrie #model #hugging-face #json-schema #context-free-grammar #llama-cpp
  23. pred-recdec

    Predicated Recursive Descent Parsing with BNF and impure hooks

    v0.2.1 #ast #recursion-descent-parser #grammar #bnf #tokenize #ll-parser #recursive-descent #regex #token-stream #lookahead
  24. rwkv-tokenizer

    A fast RWKV Tokenizer

    v0.9.1 900 #rwkv #tokenize #model #testing #world
  25. kanpyo-dict

    Dictionary Library for Kanpyo

    v0.2.0 #kanpyo #japanese #dictionary #tokenize #analyzer
  26. huggingface/tokenizers-python

    💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

    GitHub 0.22.3-dev.0 #tokenize #bert #word-piece #language-model #training #byte-level #bpe #pad #state-of-the-art #py
  27. vibrato-rkyv

    Vibrato: viterbi-based accelerated tokenizer with rkyv support for fast dictionary loading

    v0.7.3 #tokenize #japanese #morphological #analyzer #tokenizer
  28. pinyinchch

    一个拼音转汉字的工具库

    v0.2.0 #chinese #pinyin #tokenize #hanzi
  29. bleuscore

    A fast bleu score calculator

    v0.1.6 #tokenize #bleu #deep-learning #tokenizer
  30. lindera-python

    A morphological analysis libraries and command line interface

    v2.2.0 #morphological-analysis #tokenize #dictionary #morphological
  31. unscanny

    Painless string scanning

    v0.1.0 1.6M #tokenize #scanning #tokenizing
  32. jayce

    tokenizer 🌌

    v12.1.0 1.1K #tokenize #duo #matching #sync #generic-simd
  33. toak-ocr

    OCR engine with Apple Vision framework support for macOS

    v6.0.0 #macos-framework #tokenize #toak #prompt #repository #ocr #markdown #vision-framework #git #sensitive-information
  34. tergo-formatter

    Formatter for tergo

    v0.2.11 500 #tergo #formatter #tokenize #aqua #directory #latin #code-formatter
  35. tergo-parser

    Parser for tergo

    v0.2.6 340 #tergo #formatting #parser #tokenize #line #latin
  36. tergo-tokenizer

    R language tokenizer

    v0.2.5 300 #tokenize #formatting #tergo #language #line #latin #aqua #code-formatter
  37. europa

    A lightweight AI utilities library for Rust

    v0.0.3 #artificial-intelligence #vector-embedding #vector-search #tokenize #utilities #text-embedding #cosine-similarity #euclidean-distance #semantic-search #vector-database
  38. unobtanium-segmenter

    A text segmentation toolbox for search applications inspired by charabia and tantivy

    v0.5.1 #tokenize #text-segmentation #language #tokenizer
  39. go-brrr

    Token-efficient code analysis for LLMs - Rust implementation

    v0.1.0 #tree-sitter #code-analysis #tokenize #ast #ast-analysis #llm
  40. nlpo3

    Thai natural language processing library, with Python and Node bindings

    v1.4.0 1.0K #tokenize #thai #word-segmentation #tokenizer
  41. sqlite-simple-tokenizer

    This's a run-time loadable extension of SQLite fts5, supports Chinese and pinyin word segmentation and search

    v0.5.0 #sqlite-extension #tokenize #pinyin #sqlite #chinese
  42. rust_tokenizers

    High performance tokenizers for Rust

    v8.1.1 6.0K #tokenize #machine-learning #tokenizer
  43. mkcontext

    that provides functionality for creating context

    v0.7.3 800 #glob-pattern #generator #tokenize #context #limit #exclude #chatgpt
  44. libsql-sqlite3-parser

    SQL parser (as understood by SQLite) (libsql fork)

    v0.13.0 45K #sql-parser #tokenize #sql #tokenizer
  45. cuttle

    A large language model inference engine in Rust

    v0.1.1 #inference-engine #language-model #model-inference #tokenize #tensor #text-generation #performance-monitoring #benchmark
  46. turso_sqlite3_parser

    SQL parser (as understood by SQLite)

    v0.2.0-pre.7 1.4K #tokenize #sql-parser #sql #parser
  47. tokstream-core

    Core tokenizer streaming engine for tokstream

    v0.1.2 #tokenize #simulation #streaming #tokenizer
  48. tantivy-tokenizer-api

    Tokenizer API of tantivy

    v0.6.0 543K #tantivy #tokenize #full-text-search #text-indexing #tokenizer-in-charge
  49. text-tokenizer

    Custom text tokenizer

    v0.6.16 #tokenize #text
  50. rust-forth-tokenizer

    A Forth tokenizer written in Rust

    v0.2.1 480 #tokenize #forth #tokenizer
  51. trustformers-tokenizers

    Tokenizers for TrustformeRS

    v0.1.0-alpha.2 #tokenize #word-piece #bpe #tokenizer #nlp-processing
  52. scanlex

    lexical scanner for parsing text into tokens

    v0.1.4 259K #tokenize #input #tokenize-text #scan
  53. avila-tokenizers

    The most complete tokenizer library in Rust - BPE, WordPiece, Unigram, with native support for GPT, BERT, Llama, Claude

    v0.1.0 #tokenize #bert #llm #nlp #gpt #tokenizer
  54. syn_derive

    Derive macros for syn::Parse and quote::ToTokens

    v0.2.0 359K #macro-derive #to-tokens #quote #parser #tokenize #macro-parser
  55. lexers

    Tools for tokenizing and scanning

    v0.1.4 300 #lexer-tokenizer #ebnf #lexer #tokenize #tokenizer
  56. sqlite-jieba-tokenizer

    This's a run-time loadable extension of SQLite fts5, supports Chinese and English word segmentation and search

    v0.5.0 #sqlite-extension #tokenize #sqlite #chinese #english
  57. tokenx-rs

    Fast token count estimation for LLMs at 96% accuracy without a full tokenizer

    v0.1.0 #tokenize #llm #claude #gpt #tokenizer
  58. udled

    Tokenizer and parser

    v0.6.2 #tokenize #lexer #parser
  59. sqlite-charabia-tokenizer

    This's a run-time loadable extension of SQLite fts5, supports Chinese and English word segmentation and search

    v0.5.0 #sqlite-extension #tokenize #sqlite #charabia
  60. limbo_sqlite3_parser

    SQL parser (as understood by SQLite)

    v0.0.22 550 #sql-parser #tokenize #sql #parser
  61. flat-cli

    Flatten codebases into AI-friendly format

    v0.4.0 #artificial-intelligence #tokenize #flatten
  62. lindera-tokenizer

    A morphological analysis library

    v0.32.3 8.2K #morphological-analysis #tokenize #tokenizer
  63. language-tokenizer

    Text tokenizer for linguistic purposes, such as text matching. Supports more than 40 languages, including English, French, Russian, Japanese, Thai etc.

    v0.1.0 #tokenize #language #text-tokenizer #tokenizer
  64. makepad-live-tokenizer

    Makepad platform live DSL tokenizer

    v1.0.0 500 #tokenize #makepad #platform #dsl #live #wasm
  65. rustbpe

    A BPE (Byte Pair Encoding) tokenizer written in Rust with Python bindings

    v0.1.0 #python-bindings #tokenize #training #bpe #byte-pair #tiktoken #gpt-4 #regex
  66. roketok

    way to simply set up a tokenizer and use it. Not recommended for simple tokenizers as this crate adds a bunch of stuff to support many if not all kinds of tokenizers

    v0.3.1 490 #tokenize #setup #focused #kinds #not-recommended #warnings
  67. axonml-text

    Text processing utilities for the Axonml ML framework

    v0.2.8 #dataset #tokenize #ngrams #axonml #language-modeling #vocabulary #synthetic #sequence-to-sequence #white-space #unigram
  68. toktrie_hf_downloader

    HuggingFace Hub download library support for toktrie and llguidance

    v1.5.1 #structured-output #llguidance #tokenize #openai #hugging-face #context-free-grammar #json-schema #toktrie #llama #llama-cpp
  69. mipl

    Minimal Imperative Parsing Library

    v0.2.1 #tokenize #token-stream #tokenizer #parser
  70. tekken-rs

    Mistral Tekken tokenizer with audio support

    v0.1.1 320 #tokenize #artificial-intelligence #mistral #audio #nlp #tokenizer
  71. mecab-ko-core

    한국어 형태소 분석 핵심 엔진 - Lattice, Viterbi, 토크나이저

    v0.1.0 #tokenize #korean #viterbi #nlp #morphology #tokenizer
  72. quicktok

    Minimal, fast, multi-threaded implementation of the Byte Pair Encoding (BPE) for LLM tokenization

    v0.2.0 #byte-pair #multi-threading #bpe #tokenize #llm
  73. token-dict

    basic dictionary based tokenization

    v1.0.0 #tokenize #dictionary #text-tokenization #split
  74. punkt

    sentence tokenizer

    v1.0.5 #tokenize #sentence #tokenizer
  75. nenyr

    initial version of the Nenyr parser delivers robust foundational capabilities for interpreting Nenyr syntax. It intelligently processes central, layout, and module contexts, handling complex variable…

    v1.0.0-beta.1 370 #domain-specific-language #css #css-framework #themes #tokenize #seamless-integration #breakpoints #context-aware #animation #dsl
  76. marukov

    markov chain text generator

    v0.0.2 #markov-chain #generator #text-generator #text-generation #tokenize #generations
  77. segtok

    Sentence segmentation and word tokenization tools

    v0.1.5 11K #tokenize #split #tokenizer #word
  78. toktrie_tiktoken

    HuggingFace tokenizers library support for toktrie and llguidance

    v1.5.0 #tokenize #structured-output #llguidance #openai #json-schema #hugging-face #context-free-grammar #toktrie #llama #llama-cpp
  79. rusqlite-ext

    Rusqlite extension for building the FTS5 tokenizer

    v0.38.0 #sqlite-extension #tokenize #rusqlite #sqlite
  80. fuzzy-pickles

    A low-level parser of Rust source code with high-level visitor implementations

    v0.1.1 #tokenize #rust #tokenizer
  81. divvunspell

    Spell checking library for ZHFST/BHFST spellers, with case handling and tokenization support

    v1.0.0-beta.3 #spell-check #tokenize #zhfst #suggestions #archive #bhfst #memory-map #hfst-ospell #alphabet #fst
  82. notmecab

    tokenizing text with mecab dictionaries. Not a mecab wrapper.

    v0.5.1 #tokenize #mecab #dictionary #clone #version #utf-8 #tokenize-text
  83. derive-finite-automaton

    Procedural macro for generating finite automaton

    v0.3.0 #finite-automata #tokenize #tokenization
  84. indent_tokenizer

    Generate tokens based on indentation

    v0.4.0 #indentation #tokenize #tokenizer
  85. s-expression

    parser

    v0.2.0 #expression-parser #s-expr #zero-copy-parser #tokenize #borrowing #preallocated #numbers-parser #parser-compiler #performance-optimization #interpreter
  86. lang_pt

    A parser tool to generate recursive descent top down parser

    v0.1.2 #top-down-parser #tokenize #recursive-descent #parser
  87. langbox

    framework to build compilers and interpreters

    v0.6.0 440 #lexer #lexer-tokenizer #parser-combinator #tokenize
  88. kohaku

    tokenizer

    v0.1.5 #tokenize #abc #ok
  89. tokenizer-lib

    Tokenization utilities for building parsers in Rust

    v1.6.0 100 #tokenize #parser #tokenization
  90. unitoken

    Fast BPE tokenizer/trainer with a Rust core and Python bindings

    v0.1.1 #tokenize #bpe #nlp #tokenizer
  91. kitoken

    Fast and versatile tokenizer for language models, supporting BPE, Unigram and WordPiece tokenization

    v0.10.1 2.5K #tokenize #word-piece #unigram #bpe #tokenizer
  92. pinyinchch-type

    一个拼音转汉字的工具库

    v0.2.0 #chinese #pinyin #tokenize #hanzi
  93. lens

    Unified Lens query language

    v0.2.0 #query-language #query-engine #structured-data #expression #tokenize #navigating
  94. bpe-tokenizer

    A BPE Tokenizer library

    v0.1.4 150 #byte-pair #tokenize #bpe #encoding #byte
  95. ragit-korean

    korean tokenizer for ragit

    v0.4.5 #korean #tokenize #ragit #document #convert
  96. divvunspell-bin

    Spellchecker for ZHFST/BHFST spellers, with case handling and tokenization support

    v1.0.0 #spell-check #tokenize #archive #zhfst #divvunspell #bhfst #suggestions #hfst-ospell #morphological-analysis
  97. vaporetto_rules

    Rule-base filters for Vaporetto

    v0.6.5 850 #japanese #tokenize #morphological #analyzer #tokenizer
  98. vaporetto_tantivy

    Vaporetto Tokenizer for Tantivy

    v0.24.0 1.8K #tantivy #tokenize #japanese
  99. rten-text

    Text tokenization and other ML pre/post-processing functions

    v0.24.0 #tokenize #token-id #text-tokenization #text-tokenizer #hugging-face #post-processing #bert #transformer-models #eg #machine-learning
  100. chinese_segmenter

    Tokenize Chinese sentences using a dictionary-driven largest first matching approach

    v1.0.1 #chinese #tokenize #hanzi #segment #localization
  101. palate_polyglot_tokenizer

    A generic programming language tokenizer

    v0.2.1 #tokenize #generics #polyglot #programming-language #line-comment #block-comment
  102. tokenizers-enfer

    today's most used tokenizers, with a focus on performances and versatility

    v0.21.1 #tokenize #hugging-face #word-piece #bpe #tokenizer
  103. tokenmonster

    Greedy tiktoken-like tokenizer with embedded vocabulary (cl100k-base approximator)

    v0.1.0 #tokenize #tiktoken #nlp #tokenizer
  104. sentencepiece-sys

    Binding for the sentencepiece tokenizer

    v0.13.1 6.2K #bindings #tokenize #text-tokenizer #dynamic-linking #version #pkg-config #build-script
  105. mt_mtc

    Tokenizer and parser for the Minot language

    v0.5.2 #robotics #tokenize #minot
  106. skimmer

    streams reader

    v0.0.3 #stream-reader #byte-stream #tokenize
  107. ellie_tokenizer

    Tokenizer for ellie language

    v0.7.3 #tokenize #ellie #embedded #item #language
  108. irg-kvariants

    wrapper around kvariant from hfhchan/irg

    v0.1.1 39K #kvariant #kvariants #irg #hfhchan #tokenize
  109. vtext

    NLP with Rust

    v0.2.0 #tf-idf #tokenize #levenshtein #text-processing
  110. instant-clip-tokenizer

    Fast text tokenizer for the CLIP neural network

    v0.1.0 4.5K #neural-network-clip #tokenize #text-tokenizer #model #instant #python-bindings
  111. alith-models

    Load and Download LLM Models, Metadata, and Tokenizers

    v0.4.3 #gguf #model #tokenize #hugging-face #metadata #llm #embedding #artificial-intelligence
  112. rtf-grimoire

    A Rich Text File (RTF) document tokenizer. Useful for writing RTF parsers.

    v0.2.1 #rtf #rich-text #tokenize
  113. tokenizer

    Thai text tokenizer

    v0.1.2 #tokenize #localization #thai #text-tokenizer #tokeniser
  114. izihawa-tantivy-tokenizer-api

    Tokenizer API of tantivy

    v0.25.0 500 #tokenize #tantivy #full-text-search #tokenizer-in-charge #text-indexing #search-engine
  115. Try searching with DuckDuckGo.

  116. wordpieces

    Split tokens into word pieces

    v0.6.1 110 #word-piece #tokenize #piece #wordpiece #word
  117. punkt_n

    Punkt sentence tokenizer

    v1.0.5 #tokenize #sentence #punkt #tokenizer
  118. daft-functions-tokenize

    Tokenization functions for the Daft project

    v0.1.0 #tokenize #daft #functions-for-daft
  119. mako

    main Sidekick AI data processing library

    v0.3.0 #artificial-intelligence #data-processing #tokenize #data-loader #machine-learning #batch-processing #dataflow #sidekick #tokenized
  120. gtars-tokenizers

    Genomic region tokenizers for machine learning in Rust

    v0.5.2 #machine-learning #tokenize #genomics #gtars #region #overlap #genomic-region #genomic-data
  121. cang-jie

    A Chinese tokenizer for tantivy

    v0.18.0 #tokenize #tantivy #chinese #search
  122. rustpostal

    Rust bindings to libpostal

    v0.3.0 #libpostal #address-parser #street-address #expand #tokenize #street-addresses
  123. sixel-tokenizer

    A tokenizer for serialized Sixel bytes

    v0.1.0 3.9K #sixel #tokenize #byte #serialization #events #coordinate-system
  124. toresy

    term rewriting system based on tokenization

    v0.5.0 #tokenize #system #rewriting #rewriting-rules #input #text-output
  125. specmc-base

    common code for parsing Minecraft specification

    v0.1.11 490 #minecraft #tokenize #specification #identifier #base #common-code
  126. tinytoken

    tokenizing text into words, numbers, symbols, and more, with customizable parsing options

    v0.1.4 130 #tokenize #numbers #text-input #tokenizer
  127. sentencepiece-model

    SentencePiece model parser generated from the SentencePiece protobuf definition

    v0.1.4 7.4K #sentence-piece #tokenize #machine-learning
  128. simple-tokenizer

    A tiny no_std tokenizer with line & column tracking

    v0.4.2 250 #tokenize #column #no-alloc
  129. yes-lang

    Scripting Language

    v0.1.0 #scripting #repl #tokenize #infix #prefix #lexer #interpreter #type-safety #multi-line #loc
  130. rust_transformers

    High performance tokenizers for Rust

    v0.2.0 #tokenize #transformer-models #bert #byte-pair #py #bpe #word-piece #integration-tests #rust-nightly
  131. alpino-tokenizer

    Wrapper around the Alpino tokenizer for Dutch

    v0.4.0 #tokenize #finite-state-transducer #dutch #alpino #principles #text-tokenizer
  132. syntaxdot-tokenizers

    Subword tokenizers

    v0.5.0 #syntax-dot #tokenize #sentence-piece #labeling #bert #subword #lemmatization #biaffine-parser #word-piece #morphology
  133. neca-cmd

    command tokenizer used by my Twitch chat bot

    v0.3.0 250 #tokenize #chat-bot #twitch
  134. boost_tokenizer

    Boost C++ library boost_tokenizer packaged using Zanbil

    v0.1.0 #tokenize #boost #zanbil #packaged #io-stream #badge
  135. procedural-masquarade

    Incorrect spelling for procedural-masquerade

    v0.2.0 #css #css-parser #tokenize #detect #level #spelling #incorrect #character-encoding #syntax-tree
  136. tele_tokenizer

    A CSS tokenizer

    v0.2.0 #tokenize #css #telecss #tokenizer
  137. tiniestsegmenter

    Compact Japanese segmenter

    v0.3.0 750 #japanese #tokenize #ngrams
  138. ccgen

    generate manually maintained C (and C++) headers

    v0.2.0 #header #generate #tok #generator #tokenize
  139. tokengeex

    efficient tokenizer for code based on UnigramLM and TokenMonster

    v1.1.0 900 #tokenize #llm #codegeex #tokenizer
  140. data_vault

    Data Vault is a modular, pragmatic, credit card vault for Rust

    v0.3.4 #credit-card #encryption #credits #vault #tokenize #aes-gcm-siv #postgresql #blake3 #redis #encryption-key
  141. paltoquet

    rule-based general-purpose tokenizers

    v0.11.0 340 #tokenize #rule-based
  142. uscan

    A universal source code scanner

    v0.1.3 #tokenize #compiler #tokenizer
  143. sentence

    tokenizes English language sentences for use in TTS applications

    v0.0.2 #tokenize #english
  144. earl-lang-syntax

    tokenizer and parser for the language Earl

    v1.0.0 #tokenize #syntax #s-expr #language-syntax #multi-line #syntax-parser
  145. tinysegmenter

    Compact Japanese tokenizer

    v0.1.1 1.3K #japanese #tokenize #compact
  146. aleph-alpha-tokenizer

    A fast implementation of a wordpiece-inspired tokenizer

    v0.3.1 #tokenize #aleph-alpha #nlp #tokenizer
  147. caddyfile

    working with Caddy's Caddyfile format

    v0.1.1 750 #tokenize #format #caddy #testing #github
  148. castle_tokenizer

    Castle Tokenizer: tokenizer

    v0.20.2 180 #tokenize #castle
  149. strizer

    minimal and fast library for text tokenization

    v0.1.0 #text-tokenization #tokenize #string-tokenizer
  150. colorblast

    Syntax highlighting library for various programming languages, markup languages and various other formats

    v0.0.3 #syntax-highlighting #tokenize #highlighter
  151. regex-bnf

    A deterministic parser for a BNF inspired syntax with regular expressions

    v0.1.2 #regex #bnf #syntax #tokenize #syntax-parser #grammar #grammar-parser #csv-parser
  152. indentation_flattener

    From indented input, generate plain output with indentation PUSH and POP codes

    v0.1.0 #indentation #tokenize #parser
  153. nipah_tokenizer

    A powerful yet simple text tokenizer for your everyday needs!

    v0.1.0 #tokenize #text-tokenizer #nlp #tokenizer
  154. json-parser

    JSON parser

    v1.0.2 #tokenize #json #tokenizer
  155. gtokenizers

    tokenizing genomic data with an emphasis on region set data

    v0.0.18 #genomics #genomic-data #tokenize #region #machine-learning #emphasis
  156. basic_lexer

    Basic lexical analyzer for parsing and compiling

    v0.2.1 #tokenize #lexical-analysis #white-space #tokenizer
  157. tokeneer

    tokenizer crate

    v0.1.0 340 #tokenize #bpe #tokenizer
  158. xtoken

    Iterator based no_std XML Tokenizer using memchr

    v0.1.1 #tokenize #iterator #xml #memchr #byte-slice
  159. blingfire

    Wrapper for the BlingFire tokenization library

    v1.0.0 1.3K #tokenize #machine-learning #tokenizer
  160. rust-lexer

    A compiler that generates a Lexer using DFAs (inspired by flex)

    v0.2.0 #compiler #dfa #generator #tokenize #flex #output-file #input-file
  161. llm_models

    Load and download LLM models, metadata, and tokenizers

    v0.0.3 310 #gguf #model #tokenize #metadata #llm #artificial-intelligence #candle
  162. reflex

    A minimal flex-like lexer

    v0.1.2 #lexer #tokenize #flex-like
  163. morsels_lang_ascii

    Basic ascii tokenizer for morsels

    v0.7.3 #ascii #language #morsels #tokenize #tokenizer-for-morsels
  164. regex-tokenizer

    A regex tokenizer

    v0.1.1 #tokenize #regex #tokenizer
  165. alpino-tokenize

    Wrapper around the Alpino tokenizer for Dutch

    v0.4.0 #tokenize #finite-state-transducer #alpino-tokenizer #dutch #command-line-tool
  166. pretok

    A string pre-tokenizer for C-like syntaxes

    v0.1.0 #lexer-tokenizer #lexer #text #tokenize #tokenizer
  167. scanny

    A advanced text scanning library for Rust

    v0.1.0 #tokenize #lexical-token #tokenizer #parser
  168. crossandra

    A straightforward tokenization library for seamless text processing

    v0.0.2 #tokenize #tokenization-for-seamless #regex #tokenize-text #processing #lexer
  169. morsels_lang_chinese

    Chinese tokenizer for morsels

    v0.7.3 100 #chinese #morsels #language #tokenize #tokenizer-for-morsels
  170. text-scanner

    A UTF-8 char-oriented, zero-copy, text and code scanning library

    v0.0.3 #lexer #tokenize #streaming-parser
  171. quote-data

    A tokenization Library for Rust

    v1.0.0 #proc-macro #tokenize #macro-derive #struct #quote
  172. tokenate

    do some grunt work of writing a tokenizer

    v0.1.0 #tokenize #inner #parse
  173. tokenize_dir

    Tokenize file names in directories to access files in a composable way

    v0.1.0 #filenames #file-access #directory #tokenize #composable
  174. liendl_tokenizer

    BPE tokenizer for Rust

    v0.1.0 #tokenize #training #vocabulary #character #model #tokenize-text #bpe #csv #different-versions #convert-text
  175. saku

    efficient rule-based Japanese Sentence Tokenizer

    v0.1.6 #japanese #tokenize #sentence #split #rule-based
  176. sylt-tokenizer

    Tokenizer for the Sylt programming language

    v0.1.0 #sylt #tokenize #programming-language #rc #dynamically-typed #game
  177. tkn-cli

    TKN: Quick Tokenizing in the terminal

    v0.1.1 #tokenize #cli #productivity #cli-productivity
  178. token-iter

    that simplifies writing tokenizers

    v0.1.0 #tokenize #language #textual #read
  179. any-lexer

    Lexers for various programming languages and formats

    v0.0.3 #tokenize #lexer #streaming-parser
  180. rustpotion

    Blazingly fast word embeddings with Tokenlearn

    v0.3.0 #tokenize #embedding #rag #model2vec #tokenizer
  181. rs_html_parser_tokenizer

    Rs Html Parser Tokenizer

    v0.0.10 #html-parser #tokenize #browser #handle #tags #parser-error #processing-instructions #closing #case-insensitive #notes
  182. hemtt-tokens

    A token library for hemtt

    v1.0.0 #hemtt #tokenize #arma
  183. tocken

    Clustering algorithms

    v0.1.0 150 #machine-learning #tokenize #vector-search #text-tokenizer #text
  184. khmercut

    A blazingly fast Khmer word segmentation tool written in Rust

    v0.1.5 #word-segmentation #tokenize #khmer #tool #cargo-run
  185. rs_html_parser

    Rs Html Parser

    v0.0.10 #html-parser #tokenize #browser #tags #processing-instructions
  186. polyglot_tokenizer

    A generic programming language tokenizer

    v0.2.1 2.7K #tokenize #polyglot #generics #programming-language #line-comment #block-comment
  187. brack-tokenizer

    The tokenizer for the Brack programming language

    v0.1.0 #tokenize #brack
  188. cssparser-macros

    Procedural macros for cssparser

    v0.6.1 1.3M #css-parser #proc-macro #tokenize #byte #level