3 releases (stable)
| new 1.1.0 | Feb 13, 2026 |
|---|---|
| 1.0.0 | Jan 21, 2026 |
| 0.0.1 | Jan 6, 2026 |
#1178 in Parser implementations
24,160 downloads per month
Used in 2 crates
205KB
4K
SLoC
llm-tokenizer
Overview
The llm-tokenizer crate exposes a single Tokenizer facade around multiple backends
(Hugging Face JSON tokenizers, OpenAI/tiktoken models, and an in-memory mock). It packages the
shared behaviours needed by LLM applications—encoding user text, incrementally decoding streamed tokens,
tracking per-request state, and detecting stop conditions—behind trait objects so consuming code
can remain backend-agnostic.
Key capabilities:
- trait-based split between
Encoder,Decoder, andTokenizerfor shared APIs across backends - Hugging Face tokenizer loading (with optional chat templates) and HF Hub downloads
- heuristic selection of OpenAI/tiktoken encodings for GPT model names
- incremental decoding utilities (
DecodeStream,Sequence) that handle UTF-8 boundaries - stop sequence handling via
StopSequenceDecoderwith token-level and string-level triggers - optional Jinja2 chat-template rendering that matches Hugging Face semantics
The implementation deliberately keeps the surface area small—metrics, batching, or SentencePiece
support mentioned in earlier drafts do not exist today. This document reflects the actual code
as of tokenizer/src/*.
Source Map
lib.rs– module exports and theTokenizerwrapper aroundArc<dyn Tokenizer>traits.rs– shared traits and theEncoding/SpecialTokenshelper typesfactory.rs– backend discovery, file/model heuristics, and tokio-aware creation helpershub.rs– Hugging Face Hub downloads viahf_hubhuggingface.rs– wrapper overtokenizers::Tokenizer, chat template loading, vocab accesstiktoken.rs– wrapper overtiktoken-rsencoders for OpenAI model familieschat_template.rs– AST-driven Jinja template inspection and rendering utilitiessequence.rs– stateful incremental decoding helper used by router sequencesstream.rs– stateless streaming decoder that yields textual chunks from token streamsstop.rs– stop-sequence detection with "jail" buffering and a builder APImock.rs– lightweight tokenizer used by unit teststests.rs– smoke tests covering the trait facade and helpers (largely with the mock backend)cache/– multi-level caching infrastructure (L0 in-memory, L1 prefix-based)
Core Traits and Types (traits.rs)
Encoder,Decoder, andTokenizertraits staySend + Syncso instances can be shared across threads. Concrete backends implement the minimal methods:encode,encode_batch,decode,vocab_size, special-token lookup, and optional token↔id conversions.Encodingwraps backend-specific results:Hfholds the Hugging Face encoding object,Spis a plain ID vector reserved for future SentencePiece support, andTiktokenstores u32 IDs fromtiktoken-rs.Encoding::token_ids()is the zero-copy accessor used everywhere.SpecialTokenscollects optional BOS/EOS/etc. markers so upstream code can make backend-agnostic decisions.Tokenizer(inlib.rs) is a thinArc<dyn Tokenizer>newtype that exposes convenience methods (encode,decode,decode_stream, etc.) while keeping cloning cheap.
Backend Implementations
HuggingFaceTokenizer (huggingface.rs)
- Loads
tokenizer.json(or similar) usingtokenizers::Tokenizer::from_file. - Caches vocab forward and reverse maps for
token_to_id/id_to_tokensupport. - Extracts special tokens using common patterns (e.g.
<s>,[CLS]). - Supports optional chat templates: either auto-discovered next to the tokenizer via
tokenizer_config.jsonor overridable with an explicit template path. - Exposes
apply_chat_templatewhich renders a minijinja template given JSON message payloads and template parameters.
TiktokenTokenizer (tiktoken.rs)
- Wraps the
tiktoken-rsCoreBPEbuilders (cl100k_base,p50k_base,p50k_edit,r50k_base). from_model_nameheuristically maps OpenAI model IDs (e.g.gpt-4,text-davinci-003) to those bases. Unknown model names return an error rather than silently defaulting.- Implements encode/decode operations; batch encode simply iterates sequentially.
- Provides approximate vocab sizes and common GPT special tokens. Direct token↔id lookup is not implemented—the underlying library does not expose that mapping.
MockTokenizer (mock.rs)
- Purely for tests; hard-codes a tiny vocabulary and simple whitespace tokenization.
- Implements the same trait surface so helpers can be exercised without pulling real tokenizer data.
Factory and Backend Discovery (factory.rs)
create_tokenizer{,_async}accept either a filesystem path or a model identifier. Logic:- Paths are loaded directly; the file extension (or JSON autodetection) selects the backend.
- Strings that look like OpenAI model names (
gpt-*,davinci,curie,babbage,ada) useTiktokenTokenizer. - Everything else attempts a Hugging Face Hub download via
download_tokenizer_from_hf.
- Chat templates can be injected with
create_tokenizer_with_chat_template. - Async creation uses
tokiofor network access. The blocking variant reuses or spins up a runtime when called from synchronous contexts. - SentencePiece (
.model) and GGUF files are detected but currently return a clearnot supportederror.
Hugging Face Hub Integration (hub.rs)
- Uses the async
hf_hubAPI to list and download tokenizer-related files (tokenizer.json,merges.txt,.model, etc.), filtering out weights and docs. - The helper returns the HF cache directory containing the fetched files; the factory then loads from disk using standard file paths.
- Honour the
HF_TOKENenvironment variable for private or rate-limited models. Without it the download may fail with an authorization error.
Chat Template Support (chat_template.rs)
- Detects whether a template expects raw string content or the structured OpenAI-style
contentlist by walking the minijinja AST. This matches the Python-side detection logic used elsewhere in SGLang. ChatTemplateProcessor(constructed per call) renders templates against JSONmessagesandChatTemplateParams(system prompt, tools, EOS token handling, etc.). Errors surface asanyhow::Error, keeping parity with Hugging Face error messages.- The tokenizer wrapper stores both the template string and its detected content format so callers can pre-transform message content correctly.
Streaming and Stateful Helpers
DecodeStream (stream.rs)
- Maintains a sliding window (
prefix_offset,read_offset) over accumulated token IDs. - Each
stepdecodes the known prefix and the new slice; when the new slice produces additional UTF-8 text (and does not end in the replacement character�), it returns the incremental chunk and updates offsets. Otherwise it returnsNoneand waits for more tokens. step_batchandflushoffer convenience for batching and draining remaining text.
Sequence (sequence.rs)
- Holds per-request decoding state: accumulated IDs plus offsets mirroring
DecodeStream. append_textencodes extra prompt text;append_tokendecodes incremental output while respecting UTF-8 boundaries and replacing stray�characters.- Designed for integration with router sequence management where decoded text must be replayed.
StopSequenceDecoder (stop.rs)
- Extends the incremental decoding approach with a "jail" buffer that holds potential partial matches against configured stop sequences.
- Supports both token-level stops (visible or hidden) and arbitrary string sequences. When a string stop is configured, the decoder emits only the safe prefix and keeps a suffix jailed until it can decide whether it completes a stop sequence.
- Provides
StopSequenceDecoderBuilderfor ergonomic configuration and exposesprocess_token,process_tokens,flush,reset, andis_stoppedhelpers.
Caching (cache/)
The caching subsystem provides multi-level caching for tokenizer results:
L0Cache: In-memory LRU cache for exact-match token ID lookupsL1Cache: Prefix-based cache that can reuse partial encoding resultsCachedTokenizer: Wrapper that adds caching to any tokenizer implementationTokenizerFingerprint: Content-based fingerprinting for cache key generation
Testing
- Unit tests cover the mock tokenizer, the
Tokenizerwrapper, incremental decoding helpers, and stop-sequence behaviour (tests.rs,sequence.rs,stop.rs,tiktoken.rs,factory.rs,hub.rs). Network-dependent Hugging Face downloads are exercised behind a best-effort async test that skips in CI without credentials. - Use
cargo test -p tokenizerto run the crate's test suite.
Known Limitations & Future Work
- SentencePiece (
.model) and GGUF tokenizers are detected but deliberately unimplemented. Encoding::Spexists for future SentencePiece support but currently behaves as a simpleVec<u32>.TiktokenTokenizercannot map individual tokens/IDs; the underlying library would need to expose its vocabulary to implementtoken_to_id/id_to_token.- There is no metrics or batching layer inside this module; the router records metrics elsewhere.
- Dynamic batching / sequence pooling code that earlier READMEs mentioned never landed in Rust.
Usage Examples
use std::sync::Arc;
use llm_tokenizer::{
create_tokenizer, SequenceDecoderOutput, StopSequenceDecoderBuilder, Tokenizer,
};
// Load a tokenizer from disk (Hugging Face JSON)
let tokenizer = Tokenizer::from_file("/path/to/tokenizer.json")?;
let encoding = tokenizer.encode("Hello, world!", false)?;
assert!(!encoding.token_ids().is_empty());
// Auto-detect OpenAI GPT tokenizer
let openai = create_tokenizer("gpt-4")?;
let text = openai.decode(&[1, 2, 3], true)?;
// Incremental decoding with stop sequences
let mut stream = tokenizer.decode_stream(&[], true);
let mut stop = StopSequenceDecoderBuilder::new(Arc::clone(&tokenizer))
.stop_sequence("\nHuman:")
.build();
for &token in encoding.token_ids() {
if let Some(chunk) = stream.step(token)? {
match stop.process_token(token)? {
SequenceDecoderOutput::Text(t) => println!("{}", t),
SequenceDecoderOutput::StoppedWithText(t) => {
println!("{}", t);
break;
}
SequenceDecoderOutput::Held | SequenceDecoderOutput::Stopped => {}
}
}
}
// Apply a chat template when one is bundled with the tokenizer
use llm_tokenizer::{chat_template::ChatTemplateParams, HuggingFaceTokenizer};
let mut hf = HuggingFaceTokenizer::from_file_with_chat_template(
"./tokenizer.json",
Some("./chat_template.jinja"),
)?;
let messages = vec![
serde_json::json!({"role": "system", "content": "You are concise."}),
serde_json::json!({"role": "user", "content": "Summarise Rust traits."}),
];
let prompt = hf.apply_chat_template(
&messages,
ChatTemplateParams {
add_generation_prompt: true,
continue_final_message: false,
tools: None,
documents: None,
template_kwargs: None,
},
)?;
Set HF_TOKEN in the environment if you need to download private models from the Hugging Face Hub.
Dependencies
~47MB
~637K SLoC