-
fancy-regex
regexes, supporting a relatively rich set of features, including backreferences and look-around
-
stop-words
Common stop words in many languages
-
whatlang
Fast and lightweight language identification library for Rust
-
apalis-cron
extensible library for scheduling recurring tasks in rust
-
textsurf
Webservice for efficiently serving multiple plain text documents or excerpts thereof (by unicode character offset), without everything into memory
-
markdown_timesheet
processing markdown files to extract and format timesheet data
-
google-language1
A complete library to interact with Cloud Natural Language (protocol v1)
-
qdrant-rust-stemmers
some popular snowball stemming algorithms
-
google-language1_beta1
A complete library to interact with Cloud Natural Language (protocol v1beta1)
-
deformat
Extract plain text from HTML, PDF, and other document formats
-
ripopt
A memory-safe interior point optimizer in Rust
-
kalosm-sample
A common interface for token sampling and helpers for structered llm sampling
-
ck-embed
Text embedding providers for ck semantic search
-
english-to-cron
converts natural language into cron expressions
-
computer-says-no
Local embedding service for text classification using ONNX models
-
axonml
A complete ML/AI framework in pure Rust - PyTorch-equivalent functionality
-
astorion
A Duckling-inspired, rule-based entity parsing engine in Rust, designed for extensible time and numeral parsing using a saturation-style pipeline
-
gline-rs
Inference engine for GLiNER models
-
popsam-cli
CLI for AI-assisted selection of semantically representative texts
-
mmd-mpl
MPL is a rule-based Domain-Specific Language for creating MMD poses and animations using natural semantic syntax
-
two_timer
parser for English time expressions
-
you
Translate your natural language into executable command(s)
-
bareun_rs
an unofficial Rust library for Bareun, a Korean morphological analyzer
-
crfsuite-compliant-rs
Pure Rust implementation of CRFsuite (Conditional Random Fields) for labeling sequential data
-
sastrawi-rs
High-performance Indonesian stemmer (Nazief-Adriani + ECS). Zero-regex, FST-powered, Rust 2024.
-
wicket
Wikipedia corpus knowledge extractor
-
wikiext
extracting and processing Wikipedia data, implemented in Rust
-
riptoken
Fast BPE tokenizer for LLMs — a faster, drop-in compatible reimplementation of tiktoken
-
kiwi-rs
Ergonomic Rust bindings for the Kiwi Korean morphological analyzer C API
-
rustling
A blazingly fast library for computational linguistics
-
nattydate
Lightweight, deterministic natural language date/time preprocessor — no ML, no clock fragility
-
tokie
Blazingly fast tokenizer - 50x faster tokenization, 10x smaller model files, 100% accurate drop-in replacement for HuggingFace
-
wordvec
A compact
SmallVec<T>-like container with onlyalign_of::<T>()overhead for small stack-only instances -
bm25_turbo
The fastest BM25 information retrieval engine — 28K QPS on 8.8M docs
-
writing-analysis
Lightweight writing analysis and NLP tools for Rust
-
haqumei-cli
Command-line interface for the Haqumei G2P (Grapheme-to-Phoneme) engine
-
clockwords
Find and resolve natural-language time expressions across multiple languages
-
agentzero-plugin-sdk
Plugin SDK for building AgentZero WASM plugins
-
haqumei
Japanese Grapheme-to-Phoneme (G2P) library implemented in Rust
-
phonetik
Phonetic analysis engine for English. Rhyme detection, stress scanning, meter analysis, and syllable counting with a 126K-word embedded dictionary.
-
cronify
convert natural language time expressions into cron syntax
-
normy
Ultra-fast, zero-copy text normalization for Rust NLP pipelines & tokenizers
-
corpa
The ripgrep of text analysis. Blazing-fast CLI for corpus-level NLP statistics.
-
model2vec-rs
Official Rust Implementation of Model2Vec
-
pretokie
Fast, zero-allocation pretokenizers for BPE tokenizers
-
mongodb-voyageai
A client for generating embeddings and reranking with Voyage AI
-
wikipedia-article-transform
Transform Wikipedia articles in html to plaintext and markdown formats
-
opencc-jieba-rs
High-performance Chinese text conversion and segmentation using Jieba and OpenCC-style dictionaries
-
todoist-api-rs
Todoist API client library
-
langextract-rust
extracting structured and grounded information from text using LLMs
-
anno
Named entity recognition, coreference resolution, and zero-shot entity types
-
kiru
Fast text chunking for Rust
-
yaak
Translate natural language to bash commands using an OpenAI-compatible LLM
-
chunk
The fastest semantic text chunking library — up to 1TB/s chunking throughput
-
embellama
High-performance Rust library for generating text embeddings using llama-cpp
-
idoit
AI-powered command line simplifier — do it!
-
jon
Natural language interface for Joy and Jot - CLI, TUI, and desktop app
-
hypembed
Pure-Rust BERT-compatible text embedding inference for local-first applications
-
unimorph
Command-line interface for UniMorph morphological data
-
aprender-rag
Pure-Rust Retrieval-Augmented Generation pipeline built on Trueno
-
duckling
port of Facebook's Duckling library for parsing natural language into structured data
-
gibberish-or-not
Figure out if text is gibberish or not
-
embedd
Embedding interfaces + local backends (Candle/HF)
-
opencc-fmmseg
High-performance Chinese conversion library (Simplified ↔ Traditional) using OpenCC lexicons and FMM segmentation — no runtime I/O, cross-platform, and production-ready
-
kawat
Web content extraction library inspired by trafilatura. Extracts main text, metadata, and comments from HTML.
-
wicket-cli
Wikipedia corpus knowledge extractor
-
sai-cli
('sai') — Tell the shell what you want, not how to do it. Natural-language to safe shell command generator.
-
textalyzer
Analyze key metrics like number of words, readability, and complexity of any kind of text
-
unimorph-cli
Command-line interface for UniMorph morphological data
-
intentdb
Schema-free, natural language storage engine
-
markovify-rs
A fast, extensible Rust implementation of a Markov chain text generator, inspired by markovify
-
rosetta-aisp
Bidirectional prose ↔ AISP symbolic notation conversion based on the Rosetta Stone mappings
-
udpipe-rs
Rust bindings for UDPipe - a trainable pipeline for tokenization, tagging, lemmatization and dependency parsing of CoNLL-U files
-
lingua-latvian-language-model
The Latvian language model for Lingua, an accurate natural language detection library
-
ynab-mcp
Model Context Protocol server for YNAB (You Need A Budget)
-
whichtime-cli
Command-line interface for parsing natural language dates
-
pii-masker
Rust port of the HydroXai PII masker with a library API and CLI
-
lingua-japanese-language-model
The Japanese language model for Lingua, an accurate natural language detection library
-
attuned-infer
Fast, transparent inference of human state axes from natural language
-
nanofts
High-performance full-text search engine in Rust
-
langmail
Email preprocessing for LLMs
-
lingua-swedish-language-model
The Swedish language model for Lingua, an accurate natural language detection library
-
isu
Information State Update theory, applicable in Issue-Based Dialogue Management and Conversational Agent Architecture
-
lingua-czech-language-model
The Czech language model for Lingua, an accurate natural language detection library
-
instant-segment
Fast English word segmentation
-
lingua-hindi-language-model
The Hindi language model for Lingua, an accurate natural language detection library
-
ticktickrs
A CLI Tool for TickTick tasks
-
lingua-irish-language-model
The Irish language model for Lingua, an accurate natural language detection library
-
cjclassifier
Classify ideograph text as Chinese Simplified, Chinese Traditional, or Japanese using a statistical model
-
lingua-bulgarian-language-model
The Bulgarian language model for Lingua, an accurate natural language detection library
-
langidentify-models-lite
Lite embedded model data for the langidentify language detection library
-
lingua-hungarian-language-model
The Hungarian language model for Lingua, an accurate natural language detection library
-
lingua-serbian-language-model
The Serbian language model for Lingua, an accurate natural language detection library
-
lingua-bosnian-language-model
The Bosnian language model for Lingua, an accurate natural language detection library
-
memchunk
The fastest semantic text chunking library — up to 1TB/s chunking throughput
-
lingua-tagalog-language-model
The Tagalog language model for Lingua, an accurate natural language detection library
-
wg-ragsmith
Semantic chunking and RAG utilities for document processing and retrieval-augmented generation
-
lingua-afrikaans-language-model
The Afrikaans language model for Lingua, an accurate natural language detection library
-
pgf2json
Application Programming Interface to load and interpret grammars compiled in Portable Grammar Format (PGF). The PGF format is produced as a final output from the GF compiler. The library…
-
lingua-thai-language-model
The Thai language model for Lingua, an accurate natural language detection library
-
rus-torch
A comprehensive deep learning framework in Rust, merging core, nn, vision, text, and wasm
-
lingua-tamil-language-model
The Tamil language model for Lingua, an accurate natural language detection library
-
lingua-yoruba-language-model
The Yoruba language model for Lingua, an accurate natural language detection library
-
lingua-maori-language-model
The Māori language model for Lingua, an accurate natural language detection library
-
wetext-rs
Text normalization library for TTS, Rust implementation of WeText
-
lingua-ganda-language-model
The Ganda language model for Lingua, an accurate natural language detection library
-
lingua-mongolian-language-model
The Mongolian language model for Lingua, an accurate natural language detection library
-
lingua-albanian-language-model
The Albanian language model for Lingua, an accurate natural language detection library
-
bm-25
BM25 embedder, scorer, and search engine
-
lingua-danish-language-model
The Danish language model for Lingua, an accurate natural language detection library
-
lingua-romanian-language-model
The Romanian language model for Lingua, an accurate natural language detection library
-
lingua-persian-language-model
The Persian language model for Lingua, an accurate natural language detection library
-
lingua-catalan-language-model
The Catalan language model for Lingua, an accurate natural language detection library
-
lingua-welsh-language-model
The Welsh language model for Lingua, an accurate natural language detection library
-
lingua-german-language-model
The German language model for Lingua, an accurate natural language detection library
-
lingua-portuguese-language-model
The Portuguese language model for Lingua, an accurate natural language detection library
-
lingua-icelandic-language-model
The Icelandic language model for Lingua, an accurate natural language detection library
-
lingua-french-language-model
The French language model for Lingua, an accurate natural language detection library
-
mecrab
A high-performance, thread-safe morphological analyzer compatible with MeCab, written in pure Rust
-
lingua-tswana-language-model
The Tswana language model for Lingua, an accurate natural language detection library
-
lingua-marathi-language-model
The Marathi language model for Lingua, an accurate natural language detection library
-
lingua-sotho-language-model
The Sotho language model for Lingua, an accurate natural language detection library
-
newsfresh
CLI and library for querying, filtering, and analyzing GDELT Global Knowledge Graph (GKG) v2.1 data — the world's largest open news event dataset
-
ai-translator
基于 AI 的多语言文本翻译工具,支持自定义提示词
-
nodedb-document
Shared document engine (text analysis, BM25, inverted index) for NodeDB Origin and Lite
-
umsc
Uyghur multi-script converter for Arabic, Latin, Yengi, Cyrillic, XJUS, and Uzbek Latin scripts
-
mecha10-nodes-llm-command
Natural language command parsing via LLM APIs (OpenAI, Claude, Ollama)
-
wikiext-cli
Wikiext is a tool for extracting and processing Wikipedia data, implemented in Rust
-
trustformers
port of Hugging Face Transformers
-
llm-text
processing text for LLM consumption
-
ctranslate2-server
A high-performance inference server for CTranslate2 models, compatible with OpenAI's API
-
llm_utils
The best possible text chunker and text splitter and other text tools
-
mecab-ko
한국어 형태소 분석기 - MeCab-Ko의 순수 Rust 구현
-
lingua-turkish-language-model
The Turkish language model for Lingua, an accurate natural language detection library
-
cro_stem
A lightning-fast, zero-dependency Croatian stemming library written in Rust
-
natural
Pure rust library for natural language processing
-
pii
PII detection and anonymization with deterministic, capability-aware NLP pipelines
-
langdetect-rs
Language detection in Rust. Port of Mimino666's langdetect.
-
reinfer-client
API client for Re:infer, the conversational data intelligence platform
-
wideword
Fast word-length bucketing for text documents using SIMD
-
ragrep
A fast, natural language code search tool
-
legalis
Command-line interface for Legalis-RS
-
pdfvec
High-performance PDF text extraction library for vectorization pipelines
-
lingua-kazakh-language-model
The Kazakh language model for Lingua, an accurate natural language detection library
-
a3s-cron
Cron scheduling library for A3S with natural language support
-
budouy
Rust port of BudouX with optional HTML processing and CLI
-
gitctx
MCP server for GitHub repository exploration
-
lingua-vietnamese-language-model
The Vietnamese language model for Lingua, an accurate natural language detection library
-
semantic-commands
A lightweight Rust framework for defining and executing semantic commands using text embeddings
-
date_time_parser
Rust NLP library for parsing English natural language into dates and times
-
fibpetokenizer
A blazing fast Byte Pair Encoding (BPE) tokenizer library with Python bindings
-
lingua-slovene-language-model
The Slovene language model for Lingua, an accurate natural language detection library
-
wordcutw
A C-interface wrapper for Wordcut - a Lao/Thai word segmentation/breaking library
-
lingua-slovak-language-model
The Slovak language model for Lingua, an accurate natural language detection library
-
commit_crafter
AI powered tool for Git commit message generator
-
ds-r1-rs
A DeepSeek R1-inspired reasoning model prototype in Rust
-
bayesian
A naive Bayesian classifier with optional TF-IDF support
-
mecab-ko-hangul
한글 처리 유틸리티 - 자모 분리/결합, 음절 처리, 정규화
-
fast-bpe-rs
Fast Byte Pair Encoding (BPE) tokenizer with Python bindings powered by PyO3
-
flerp
CLI tool that does XYZ
-
edgebert
Fast local text embeddings library for Rust and WASM for BERT inference on native and edge devices with no dependencies
-
hy-mt
A lightweight machine translation inference library for Tencent Hunyuan MT models
-
embedcache
High-performance text embedding service with caching capabilities
-
cali
A terminal calculator with real-time evaluation, unit conversions, and natural language expressions
-
nysiis
A fast NYSIIS (New York State Identification and Intelligence System) phonetic encoding library
-
avila-tokenizers
The most complete tokenizer library in Rust - BPE, WordPiece, Unigram, with native support for GPT, BERT, Llama, Claude
-
thulp-query
Query engine for searching and filtering thulp tools
-
waken_snowball
Snowball stemming algorithms for 33 languages
-
anno-metrics
Shared evaluation/analysis primitives for anno (metrics + cluster encoders)
-
natural-date-rs
A parser to convert natural language date and time specifications into DateTime
-
kizame
(刻め!) - CLI for MeCrab morphological analyzer and data pipeline
-
tessera-embeddings
Multi-paradigm embedding library: ColBERT, dense, sparse, vision-language, and time series models
-
cairn-extract
Rule-based claim extraction from markdown with confidence scoring
-
sdaas-rs
Official Rust SDK for SDaaS — Semantic Delta as a Service
-
treebender
An HDPSG inspired symbolic NLP library for Rust
-
langid-rs
A fast and lightweight language identification library in Rust, inspired by py3langid
-
slabs
Text chunking for RAG: fixed, sentence, recursive, and semantic strategies
-
ayumu
A small, lightweight, user-oriented query language for search forms
-
jon-cli
Natural language interface for Joy and Jot - CLI for the Joyint ecosystem
-
mathsys
The Natural Language of Math
-
whichtime
High-level Rust API for natural language date parsing
-
xase-sidecar
XASE AI Sidecar: high-performance evidence and data processing sidecar (audio/image/DICOM/NLP) with S3, Redis, JWT auth, and metrics
-
intent-gen
Natural language to IntentLang spec generation via LLM (Layer 0)
-
lingua-belarusian-language-model
The Belarusian language model for Lingua, an accurate natural language detection library
-
popsam-py
Python extension crate for AI-assisted selection of semantically representative texts
-
kalosm-learning
A simplified machine learning library for building off of pretrained models
-
whichtime-sys
Lower-level parsing engine for natural language date parsing
-
popsam-core
Core library for AI-assisted selection of semantically representative texts
-
tekken-rs
Mistral Tekken tokenizer with audio support
-
vn-nlp
Vietnamese NLP library — tokenization, normalization, segmentation
-
almanaculum
Core types and traits for analysis
-
legalis-llm
LLM integration layer for Legalis-RS
-
mecab-ko-dict-builder
한국어 형태소 사전 빌더 - CSV에서 바이너리 사전 생성
-
remindee-parser
Natural language reminder parser for remindee-bot
-
mecab-ko-dict-validator
한국어 형태소 사전 검증 도구 - CSV 형식 검증, 품사 체계 검사
-
vader_sentiment
Bindings for Rust from the original Python VaderSentiment analysis tool
-
mecab-ko-dict
한국어 형태소 사전 관리 - 바이너리 포맷, FST 검색, 연접 비용
-
kalosm-model-types
Shared types for Kalosm models
-
rs-jptxt2tokens
wrapper to convert the jp txt to tokens
-
nlcep
parsing natural language calendar events
-
nlsd
Natural Language Structured Documents
-
langidentify-models-lite-a
Lite embedded model data (part A: European Latin) for the langidentify language detection library
-
langidentify-models-lite-b
Lite embedded model data (part B: other scripts) for the langidentify language detection library
-
reggy
friendly, resumable regular expressions for text analytics
-
unitoken
Fast BPE tokenizer/trainer with a Rust core and Python bindings
-
flash_rerank
Core reranking engine — cross-encoder and ColBERT inference via ONNX Runtime
-
amdm
Rust client for amdm.ru with Russian lyrics stress marking and meter analysis
-
byteforge
A next-generation byte-level transformer with multi-signal patching and SIMD optimization
-
aistack
Functional text-to-function AI utilities
-
rust-chatgpt
OpenAI API Client for Rust
-
ragegun
Performs lexica based analysis on text (i.e. age, gender, PERMA, OCEAN personality traits, ..)
-
rusty-llm-jury
CLI tool for estimating success rates when using LLM judges for evaluation
-
rust_readability
A package to assess the complexity of texts using a variety of readability formulas
-
vader-sentimental
A faster Rust version from the original Python VaderSentiment analysis tool
-
vn-nlp-tokenize
Vietnamese tokenization algorithms for vn-nlp
-
sisu
working with SISU (Statecharts-based implementation of Information State Update
-
str-distance
Distance metrics to evaluate distances between strings
-
oxur-lang
Oxur language processing: parser, expander, and Core Forms IR
-
rsnltk
Rust-based Natural Language Toolkit
-
vn-nlp-segment
Vietnamese sentence segmentation for vn-nlp
-
vn-nlp-normalize
Vietnamese text normalization — diacritics, unicode NFC/NFD
-
qtransformers-core
Quantum-inspired attention mechanisms for transformer models
-
llm-shield-nlp
Natural language processing utilities for LLM Shield
-
geocoder_nlp
Rust bindings for geocoder-nlp
-
stylometry-analyzer
Minimal CLI tool that combines one or more
.txtfiles, extracts user-authored text, and enforces a minimum size. Hash-embeds text chunks and queries a local vector DB to classify writing style… -
mecrab-word2vec
High-performance Word2Vec implementation with Hogwild! parallelization for MeCrab
-
libtqsm
Sentence segmenter that supports ~300 languages
-
edge-transformers
wrapper over ONNXRuntime that implements Huggingface's Optimum pipelines for inference and generates bindings for C# and C
-
sbert
Sentence Bert (SBert)
-
gematria_rs
Gematria, a traditional Hebrew numerology system
-
chrono-english
parses simple English dates, inspired by Linux date command
-
symbol-map
Memory-efficient mapping from values to integer identifiers (AKA a lexicon or symbol table), with options for fast bidirectional lookup
-
wikidump
parsing Mediawiki XML dumps
-
intent-classifier
A flexible few-shot intent classification library for natural language processing
-
mecrab-builder
Semantic dictionary builder for MeCrab - Wikidata/Wikipedia pipeline
-
sagacity
A Rust-based project for conversing with your codebase and handling codebase contextualization
-
qsv_vader_sentiment_analysis
Bindings for Rust from the original Python VaderSentiment analysis tool. Forked for use with qsv.
-
deepfrog
A deep learning NLP suite (PoS,lemmatiser,NER) with FoLiA XML support
-
when
'When' parses natural language date/time and produces computer friendly output structures
-
cp-embeddings
Local embedding generation using GTE-Qwen2-1.5B-instruct via Candle — private, on-device AI inference
-
event_parser
Rust NLP library for parsing English natural language into icalendar events
-
zoea
by and for baby Rustaceans. It contains 'easy' buttons for common things like http get requests, key-value database persistence, and Natural Language Processing.
-
mcprs
Model Context Protocol para Rust - Uma biblioteca unificada para comunicação com diferentes LLMs e APIs de IA
-
timewarp
NLP library for parsing English and German natural language into dates and times
-
temporis
Parse natural date strings into valid dates
-
langram_train
Langram train models
Try searching with DuckDuckGo.