#tokenizer

  1. logos

    Create ridiculously fast Lexers

    v0.16.1 4.8M #lexer-tokenizer #lexer #no-std #lexical #tokenizer
  2. tokenizers

    today's most used tokenizers, with a focus on performances and versatility

    v0.22.2 993K #tokenize #hugging-face #word-piece #bpe #tokenizer
  3. xmlparser

    Pull-based, zero-allocation XML parser

    v0.13.6 5.4M #tokenize #xml #tokenizer
  4. text-splitter

    Split text into semantic chunks, up to a desired chunk size. Supports calculating length by characters and tokens, and is callable from Rust and Python.

    v0.29.3 84K #artificial-intelligence #split #tokenizer
  5. charabia

    detect the language, tokenize the text and normalize the tokens

    v0.9.9 43K #tokenize #normalize #tokenizer
  6. html5gum

    A WHATWG-compliant HTML5 tokenizer and tag soup parser

    v0.8.3 40K #html-parser #tokenize #whatwg #html5 #html #tokenizer
  7. htmlparser

    Pull-based, zero-allocation HTML parser

    v0.2.1 62K #tokenize #html-parsing #tokenizer
  8. llm-tokenizer

    LLM tokenizer library with caching and chat template support

    v1.1.0 24K #tokenize #hugging-face #llm #tiktoken #chat-template #tokenizer
  9. lindera-tantivy

    Lindera Tokenizer for Tantivy

    v2.0.0 6.6K #tokenize #lindera #tantivy #tokenizer
  10. scnr2

    Scanner/Lexer with regex patterns and multiple modes

    v0.5.0 1.4K #lexer-tokenizer #lexer #tokenizer
  11. erl_tokenize

    Erlang source code tokenizer

    v0.10.0 1.4K #lexer-tokenizer #erlang #lexer #tokenize #tokenizer
  12. azul-simplecss

    A very simple CSS 2.1 tokenizer with CSS nesting support

    v0.2.0 7.0K #tokenize #css-parser #css #nested #tokenizer
  13. elyze

    extensible general purpose framework parser allowing to parser any type of data without allocation

    v1.5.5 1.8K #lexer-tokenizer #lexer #tokenizer
  14. vaporetto

    pointwise prediction based tokenizer

    v0.6.5 9.4K #tokenize #japanese #analyzer #morphological #tokenizer
  15. normy

    Ultra-fast, zero-copy text normalization for Rust NLP pipelines & tokenizers

    v0.1.4 #nlp #zero-copy #llm #normalization #tokenizer
  16. bpe

    Fast byte-pair encoding implementation

    v0.2.1 6.9K #tokenize #encoding #algorithm #tokenizer
  17. splintr

    Fast Rust BPE tokenizer with Python bindings

    v0.8.0 #tokenize #llm #tiktoken #bpe #gpt #tokenizer
  18. octofhir-fhirpath-parser

    Parser and tokenizer for FHIRPath expressions

    v0.4.20 1.0K #tokenize #parser #fhir #tokenizer #fhirpath
  19. tokstream-cli

    CLI token stream simulator using Hugging Face tokenizers

    v0.1.2 #tokenize #streaming #tokenizer #cli
  20. logos-codegen

    Create ridiculously fast Lexers

    v0.16.1 3.5M #lexer-tokenizer #lexer #lexical #no-std #tokenizer
  21. vibrato

    viterbi-based accelerated tokenizer

    v0.5.2 3.3K #tokenize #japanese #tokenizer
  22. noa-parser

    Noa parser is an extensible general purpose framework parser allowing to parser any type of data without allocation

    v0.7.4 430 #lexer-tokenizer #lexer #tokenizer
  23. libsimple

    Rust bindings to simple, a SQLite3 fts5 tokenizer which supports Chinese and PinYin

    v0.6.1 1.2K #sqlite-extension #fts5 #sqlite #tokenizer
  24. smoltok-core

    Byte-Pair Encoding tokenizer implementation in Rust

    v0.1.1 #bpe #encoding #text-processing #tokenizer
  25. llm-utl

    Convert code repositories into LLM-friendly prompts with smart chunking and filtering

    v0.1.5 #code-analysis #llm #prompt #tokenizer
  26. wordchipper

    HPC Rust LLM Tokenizer Library

    v0.6.2 #bpe #gpt #tokenizer
  27. vibrato-rkyv

    Vibrato: viterbi-based accelerated tokenizer with rkyv support for fast dictionary loading

    v0.7.3 #tokenize #japanese #morphological #analyzer #tokenizer
  28. bbpe

    Binary byte pair encoding (BPE) trainer and CLI compatible with Hugging Face tokenizers

    v0.6.3 #malware #hugging-face #bpe #binary #tokenizer
  29. bleuscore

    A fast bleu score calculator

    v0.1.6 #tokenize #bleu #deep-learning #tokenizer
  30. scnr2_generate

    Scanner/Lexer with regex patterns and multiple modes

    v0.5.0 1.2K #lexer-tokenizer #lexer #tokenizer
  31. lexerus

    annotated lexer

    v0.1.9 #lexer-tokenizer #lexer #tokeniser #tokenizer
  32. tantivy-stemmers

    A collection of Tantivy stemmer tokenizers

    v0.4.0 6.4K #stemming #tantivy #tokenizer
  33. unobtanium-segmenter

    A text segmentation toolbox for search applications inspired by charabia and tantivy

    v0.5.1 #tokenize #text-segmentation #language #tokenizer
  34. scnr

    Scanner/Lexer with regex patterns and multiple modes

    v0.8.0 440 #lexer-tokenizer #lexer #tokenizer
  35. nlpo3

    Thai natural language processing library, with Python and Node bindings

    v1.4.0 1.0K #tokenize #thai #word-segmentation #tokenizer
  36. rust_tokenizers

    High performance tokenizers for Rust

    v8.1.1 6.0K #tokenize #machine-learning #tokenizer
  37. gemini-tokenizer

    Authoritative Gemini tokenizer for Rust, ported from the official Google Python GenAI SDK

    v0.2.0 #google #sentence-piece #gemini #llm #tokenizer
  38. libsql-sqlite3-parser

    SQL parser (as understood by SQLite) (libsql fork)

    v0.13.0 45K #sql-parser #tokenize #sql #tokenizer
  39. svgrtypes

    SVG types parser

    v0.44.2 #svg-parser #svg #tokenizer
  40. rust-forth-tokenizer

    A Forth tokenizer written in Rust

    v0.2.1 480 #tokenize #forth #tokenizer
  41. tokstream-core

    Core tokenizer streaming engine for tokstream

    v0.1.2 #tokenize #simulation #streaming #tokenizer
  42. logos-cli

    Create ridiculously fast Lexers

    v0.16.1 #lexer-tokenizer #lexer #lexical #no-std #tokenizer
  43. trustformers-tokenizers

    Tokenizers for TrustformeRS

    v0.1.0-alpha.2 #tokenize #word-piece #bpe #tokenizer #nlp-processing
  44. lexers

    Tools for tokenizing and scanning

    v0.1.4 300 #lexer-tokenizer #ebnf #lexer #tokenize #tokenizer
  45. avila-tokenizers

    The most complete tokenizer library in Rust - BPE, WordPiece, Unigram, with native support for GPT, BERT, Llama, Claude

    v0.1.0 #tokenize #bert #llm #nlp #gpt #tokenizer
  46. bundle_repo

    Pack a local or remote Git Repository to XML for LLM Consumption

    v0.6.0 440 #git #llm #tokenizer
  47. mecab-ko

    한국어 형태소 분석기 - MeCab-Ko의 순수 Rust 구현

    v0.1.0 #korean #nlp #morphology #mecab #tokenizer
  48. bpe-openai

    Prebuilt fast byte-pair encoders for OpenAI

    v0.3.0 6.7K #bpe #algorithm #tokenizer
  49. tokenx-rs

    Fast token count estimation for LLMs at 96% accuracy without a full tokenizer

    v0.1.0 #tokenize #llm #claude #gpt #tokenizer
  50. lindera-tokenizer

    A morphological analysis library

    v0.32.3 8.2K #morphological-analysis #tokenize #tokenizer
  51. language-tokenizer

    Text tokenizer for linguistic purposes, such as text matching. Supports more than 40 languages, including English, French, Russian, Japanese, Thai etc.

    v0.1.0 #tokenize #language #text-tokenizer #tokenizer
  52. html5tokenizer

    An HTML5 tokenizer with code span support

    v0.5.2 180 #html-parser #html5 #whatwg #tokenizer
  53. mipl

    Minimal Imperative Parsing Library

    v0.2.1 #tokenize #token-stream #tokenizer #parser
  54. tekken-rs

    Mistral Tekken tokenizer with audio support

    v0.1.1 320 #tokenize #artificial-intelligence #mistral #audio #nlp #tokenizer
  55. mecab-ko-core

    한국어 형태소 분석 핵심 엔진 - Lattice, Viterbi, 토크나이저

    v0.1.0 #tokenize #korean #viterbi #nlp #morphology #tokenizer
  56. segtok

    Sentence segmentation and word tokenization tools

    v0.1.5 11K #tokenize #split #tokenizer #word
  57. punkt

    sentence tokenizer

    v1.0.5 #tokenize #sentence #tokenizer
  58. alkale

    LL(1) lexer library for Rust

    v2.0.0 490 #lexer-tokenizer #lexer #tokenizer
  59. fuzzy-pickles

    A low-level parser of Rust source code with high-level visitor implementations

    v0.1.1 #tokenize #rust #tokenizer
  60. kitoken

    Fast and versatile tokenizer for language models, supporting BPE, Unigram and WordPiece tokenization

    v0.10.1 2.5K #tokenize #word-piece #unigram #bpe #tokenizer
  61. indent_tokenizer

    Generate tokens based on indentation

    v0.4.0 #indentation #tokenize #tokenizer
  62. vaporetto_rules

    Rule-base filters for Vaporetto

    v0.6.5 850 #japanese #tokenize #morphological #analyzer #tokenizer
  63. unitoken

    Fast BPE tokenizer/trainer with a Rust core and Python bindings

    v0.1.1 #tokenize #bpe #nlp #tokenizer
  64. sql-script-parser

    iterates over SQL statements in SQL script

    v0.1.2 #sql #mysql #sql-parser #tokenizer
  65. tokenizers-enfer

    today's most used tokenizers, with a focus on performances and versatility

    v0.21.1 #tokenize #hugging-face #word-piece #bpe #tokenizer
  66. tokenmonster

    Greedy tiktoken-like tokenizer with embedded vocabulary (cl100k-base approximator)

    v0.1.0 #tokenize #tiktoken #nlp #tokenizer
  67. giron

    ECMAScript parser which outputs ESTree JSON

    v0.1.2 #javascript-parser #javascript #tokenizer
  68. code-splitter

    Split code into semantic chunks using tree-sitter

    v0.1.5 260 #split #code #tokenizer
  69. blex

    A lightweight lexing framework

    v0.2.2 #lexer-tokenizer #lexer #tokenizer
  70. tantivy-czech-stemmer

    Czech stemmer as Tantivy tokenizer

    v0.2.1 #stemming #czech #tantivy #tokenizer
  71. tokenizer

    Thai text tokenizer

    v0.1.2 #tokenize #localization #thai #text-tokenizer #tokeniser
  72. punkt_n

    Punkt sentence tokenizer

    v1.0.5 #tokenize #sentence #punkt #tokenizer
  73. Try searching with DuckDuckGo.

  74. javascript_lexer

    Javascript lexer

    v0.1.8 #lexer-tokenizer #lexer #javscript #tokenizer
  75. elizaos-plugin-local-embedding

    Local text embedding and tokenization plugin for elizaOS - Rust implementation

    v2.0.0 #artificial-intelligence #ai-agent #local-ai #tokenizer
  76. tinytoken

    tokenizing text into words, numbers, symbols, and more, with customizable parsing options

    v0.1.4 130 #tokenize #numbers #text-input #tokenizer
  77. regex-lexer

    A regex-based lexer (tokenizer)

    v0.2.0 410 #lexer-tokenizer #lexer #regex-parser #tokenizer
  78. lexariel

    Lexical analyzer for Asmodeus language

    v0.1.0 #lexer-tokenizer #assembly #asmodeus #machine-w #lexer #tokenizer
  79. smoltoken

    A fast library for Byte Pair Encoding (BPE) tokenization

    v0.2.0 360 #artificial-intelligence #bpe #tokenizer
  80. tele_tokenizer

    A CSS tokenizer

    v0.2.0 #tokenize #css #telecss #tokenizer
  81. tokengeex

    efficient tokenizer for code based on UnigramLM and TokenMonster

    v1.1.0 900 #tokenize #llm #codegeex #tokenizer
  82. uscan

    A universal source code scanner

    v0.1.3 #tokenize #compiler #tokenizer
  83. regex-lexer-lalrpop

    A regex-based lexer (tokenizer)

    v0.3.0 #lexer-tokenizer #regex-lexer #lexer #regex #regex-parser #tokenizer
  84. char-lex

    Create easy enum based lexers

    v1.0.5 #lexer #lexer-tokenizer #char #lexing #tokenizer
  85. aleph-alpha-tokenizer

    A fast implementation of a wordpiece-inspired tokenizer

    v0.3.1 #tokenize #aleph-alpha #nlp #tokenizer
  86. xxcalc

    Embeddable or standalone robust floating-point polynomial calculator

    v0.2.1 #lexer-tokenizer #evaluator #lexer #calculator #math #tokenizer
  87. pgn-lexer

    A lexer for PGN files for chess. Provides an iterator over the tokens from a byte stream.

    v0.2.0-alpha #pgn #lexer #chess #lexer-tokenizer #tokenizer
  88. tokenise

    A flexible tokeniser library for parsing text

    v0.1.0 #lexer-tokenizer #lexer #tokenizer
  89. simple-cursor

    A super simple character cursor implementation geared towards lexers/tokenizers

    v0.1.1 #lexer-tokenizer #lexer #string #iterator #cursor #no-alloc #tokenizer
  90. nipah_tokenizer

    A powerful yet simple text tokenizer for your everyday needs!

    v0.1.0 #tokenize #text-tokenizer #nlp #tokenizer
  91. tokeneer

    tokenizer crate

    v0.1.0 340 #tokenize #bpe #tokenizer
  92. json-parser

    JSON parser

    v1.0.2 #tokenize #json #tokenizer
  93. blingfire

    Wrapper for the BlingFire tokenization library

    v1.0.0 1.3K #tokenize #machine-learning #tokenizer
  94. basic_lexer

    Basic lexical analyzer for parsing and compiling

    v0.2.1 #tokenize #lexical-analysis #white-space #tokenizer
  95. regex-tokenizer

    A regex tokenizer

    v0.1.1 #tokenize #regex #tokenizer
  96. pretok

    A string pre-tokenizer for C-like syntaxes

    v0.1.0 #lexer-tokenizer #lexer #text #tokenize #tokenizer
  97. scanny

    A advanced text scanning library for Rust

    v0.1.0 #tokenize #lexical-token #tokenizer #parser
  98. sana

    Create lexers easily

    v0.1.1 #lexer-tokenizer #lexer-generator #lexer #generator #tokenizer
  99. gpt_tokenizer

    Rust BPE Encoder Decoder (Tokenizer) for GPT-2 / GPT-3

    v0.1.0 #chatgpt #gpt-3 #bpe #openai #tokenizer
  100. rustpotion

    Blazingly fast word embeddings with Tokenlearn

    v0.3.0 #tokenize #embedding #rag #model2vec #tokenizer
  101. token

    string-tokenizer (and sentence splitter) Note: If you find that you would like to use the name for something more appropriate, please just send me a mail at jaln at itu dot dk

    v1.0.0-rc1 #string-tokenizer #splitter #sentence #string #tokenizer
  102. tuker

    A small tokenizer/parser library with an emphasis on usability

    v0.1.0 #lexer-tokenizer #lexer #tokenize #tokenizer
  103. scnr2_macro

    Scanner/Lexer with regex patterns and multiple modes

    v0.5.0 1.3K #lexer-tokenizer #lexer #tokenizer