53 releases (30 stable)
Uses new Rust 2024
| new 4.3.4 | Feb 16, 2026 |
|---|---|
| 4.3.3 | Feb 14, 2026 |
| 4.2.6 | Jan 31, 2026 |
| 4.0.0-rc.23 | Dec 30, 2025 |
#481 in Text processing
31 downloads per month
3.5MB
71K
SLoC
kreuzberg-cli
Command-line interface for the Kreuzberg document intelligence library.
Overview
This crate provides a production-ready CLI tool for document extraction, MIME type detection, batch processing, and cache management. It exposes the core extraction capabilities of the Kreuzberg Rust library through an easy-to-use command-line interface.
The CLI supports 75 file formats including PDF, DOCX, PPTX, XLSX, images, HTML, and more, with optional OCR support for scanned documents.
Architecture
Binary Structure
Kreuzberg Core Library (crates/kreuzberg)
↓
Kreuzberg CLI (crates/kreuzberg-cli) ← This crate
↓
Command-line interface with configuration and caching
Features
- Extract Command: Extract text, tables, and metadata from single documents
- Batch Command: Process multiple documents in parallel with optimized concurrency
- Detect Command: Identify MIME type of any file with magic byte analysis
- Cache Commands: Manage extraction result cache (stats, clear)
- Serve Command (requires
apifeature): Start REST API server for remote document processing - MCP Command (requires
mcpfeature): Start Model Context Protocol server for AI integration - Version Command: Display version information in text or JSON format
- Configuration: TOML, YAML, or JSON config files with auto-discovery
Platform Support
The CLI is tested and officially supported on:
- ✅ Linux x86_64 (glibc and musl static)
- ✅ Linux aarch64 / ARM64 (glibc and musl static)
- ✅ macOS aarch64 (Apple Silicon)
- ✅ Windows x86_64
All platforms receive precompiled binaries through GitHub releases. Linux musl binaries are fully statically linked with zero runtime dependencies.
Installation
Install Script (Linux / macOS)
curl -fsSL https://raw.githubusercontent.com/kreuzberg-dev/kreuzberg/main/scripts/install.sh | bash
Homebrew
brew install kreuzberg-dev/tap/kreuzberg
Cargo
cargo install kreuzberg-cli
Docker
docker pull ghcr.io/kreuzberg-dev/kreuzberg-cli:latest
docker run -v $(pwd):/data ghcr.io/kreuzberg-dev/kreuzberg-cli:latest extract /data/document.pdf
From Source
cargo install --path crates/kreuzberg-cli
Or via the workspace:
cargo build --release -p kreuzberg-cli
Platform-Specific Requirements
ONNX Runtime (for embeddings)
If using embeddings functionality, ONNX Runtime must be installed:
# macOS
brew install onnxruntime
# Ubuntu/Debian
sudo apt install libonnxruntime libonnxruntime-dev
# Windows (MSVC)
scoop install onnxruntime
# OR download from https://github.com/microsoft/onnxruntime/releases
Without ONNX Runtime, embeddings will raise MissingDependencyError with installation instructions.
OCR Support (Optional)
To enable optical character recognition for scanned documents:
- macOS:
brew install tesseract - Ubuntu/Debian:
sudo apt-get install tesseract-ocr - Windows: Download from tesseract-ocr/tesseract
Quick Start
The CLI is available for Linux (x86_64/aarch64), macOS (Apple Silicon), and Windows with consistent behavior across all platforms.
Basic Text Extraction
# Extract text from a PDF
kreuzberg extract document.pdf
# Extract with JSON output
kreuzberg extract document.pdf --format json
Extract with OCR
# Enable OCR for scanned documents
kreuzberg extract scanned.pdf --ocr true
# Force OCR even if text extraction succeeds
kreuzberg extract mixed.pdf --force-ocr true
Batch Processing
# Process multiple documents in parallel
kreuzberg batch *.pdf --format json
# Process with custom configuration
kreuzberg batch documents/*.docx --config config.toml --format json
MIME Type Detection
# Detect file type
kreuzberg detect unknown-file
# JSON output
kreuzberg detect unknown-file --format json
Cache Management
# View cache statistics
kreuzberg cache stats
# Clear the cache
kreuzberg cache clear --cache-dir /path/to/cache
# Custom cache directory
kreuzberg cache stats --cache-dir ~/.kreuzberg-cache
API Server (with api feature)
# Start API server on localhost:8000
kreuzberg serve
# Custom host and port
kreuzberg serve --host 0.0.0.0 --port 3000
# With configuration file
kreuzberg serve --config kreuzberg.toml --host 127.0.0.1 --port 8080
MCP Server (with mcp feature)
# Start Model Context Protocol server
kreuzberg mcp
# With configuration file
kreuzberg mcp --config kreuzberg.toml
Configuration
The CLI supports configuration files in TOML, YAML, or JSON formats. Configuration can be:
- Explicit: Passed via
--config /path/to/config.{toml,yaml,json} - Auto-discovered: Searches for
kreuzberg.{toml,yaml,json}in current and parent directories - Default: Uses built-in defaults if no config found
Example Configuration (TOML)
# Basic extraction settings
use_cache = true
enable_quality_processing = true
force_ocr = false
# OCR configuration
[ocr]
backend = "tesseract"
language = "eng"
[ocr.tesseract_config]
enable_table_detection = true
psm = 6
min_confidence = 50.0
# Text chunking (useful for LLM processing)
[chunking]
max_chars = 1000
max_overlap = 200
# PDF-specific options
[pdf_options]
extract_images = true
extract_metadata = true
passwords = []
# Language detection
[language_detection]
enabled = true
min_confidence = 0.8
detect_multiple = false
# Image extraction
[images]
extract_images = true
target_dpi = 300
max_image_dimension = 4096
auto_adjust_dpi = true
Configuration Overrides
Command-line flags override configuration file settings:
# Override OCR setting from config
kreuzberg extract document.pdf --config config.toml --ocr false
# Override chunking settings
kreuzberg extract long.pdf --chunk true --chunk-size 2000 --chunk-overlap 400
# Disable cache despite config file
kreuzberg extract document.pdf --no-cache true
# Enable language detection
kreuzberg extract multilingual.pdf --detect-language true
Command Reference
extract
Extract text, tables, and metadata from a document.
kreuzberg extract <PATH> [OPTIONS]
Options:
--config <PATH>: Configuration file (TOML, YAML, or JSON)--mime-type <TYPE>: MIME type hint (auto-detected if not provided)--format <FORMAT>: Output format (textorjson), default:text--ocr <true|false>: Enable/disable OCR--force-ocr <true|false>: Force OCR even if text extraction succeeds--no-cache <true|false>: Disable result caching--chunk <true|false>: Enable text chunking--chunk-size <SIZE>: Chunk size in characters (default: 1000)--chunk-overlap <SIZE>: Overlap between chunks (default: 200)--quality <true|false>: Enable quality processing--detect-language <true|false>: Enable language detection
Examples:
# Simple extraction
kreuzberg extract invoice.pdf
# With configuration and JSON output
kreuzberg extract document.pdf --config config.toml --format json
# With chunking for LLM processing
kreuzberg extract report.pdf --chunk true --chunk-size 2000
# With OCR for scanned document
kreuzberg extract scanned.pdf --ocr true --format json
batch
Process multiple documents in parallel.
kreuzberg batch <PATHS>... [OPTIONS]
Options:
--config <PATH>: Configuration file (TOML, YAML, or JSON)--format <FORMAT>: Output format (textorjson), default:json--ocr <true|false>: Enable/disable OCR--force-ocr <true|false>: Force OCR even if text extraction succeeds--no-cache <true|false>: Disable result caching--quality <true|false>: Enable quality processing
Examples:
# Batch process multiple files
kreuzberg batch doc1.pdf doc2.docx doc3.xlsx
# With glob patterns
kreuzberg batch *.pdf *.docx
# With custom configuration
kreuzberg batch documents/* --config batch-config.toml --format json
# With OCR
kreuzberg batch scanned/*.pdf --ocr true --format json
detect
Identify the MIME type of a file.
kreuzberg detect <PATH> [OPTIONS]
Options:
--format <FORMAT>: Output format (textorjson), default:text
Examples:
# Simple detection
kreuzberg detect unknown-file
# JSON output
kreuzberg detect mystery.bin --format json
cache
Manage extraction result cache.
kreuzberg cache <COMMAND> [OPTIONS]
Subcommands:
stats
Show cache statistics.
kreuzberg cache stats [--cache-dir <DIR>] [--format <FORMAT>]
Options:
--cache-dir <DIR>: Cache directory (default:.kreuzbergin current directory)--format <FORMAT>: Output format (textorjson), default:text
clear
Clear the cache.
kreuzberg cache clear [--cache-dir <DIR>] [--format <FORMAT>]
Options:
--cache-dir <DIR>: Cache directory (default:.kreuzbergin current directory)--format <FORMAT>: Output format (textorjson), default:text
Examples:
# View cache statistics
kreuzberg cache stats
# Clear cache with custom directory
kreuzberg cache clear --cache-dir ~/.kreuzberg-cache
# JSON output
kreuzberg cache stats --format json
serve (requires api feature)
Start the REST API server.
kreuzberg serve [OPTIONS]
Options:
--host <HOST>: Host to bind to (default:127.0.0.1)--port <PORT>: Port to bind to (default:8000)--config <PATH>: Configuration file (TOML, YAML, or JSON)
Examples:
# Default: localhost:8000
kreuzberg serve
# Public access on port 3000
kreuzberg serve --host 0.0.0.0 --port 3000
# With custom configuration
kreuzberg serve --config server-config.toml --port 8080
mcp (requires mcp feature)
Start the Model Context Protocol server.
kreuzberg mcp [OPTIONS]
Options:
--config <PATH>: Configuration file (TOML, YAML, or JSON)
Examples:
# Start MCP server
kreuzberg mcp
# With custom configuration
kreuzberg mcp --config mcp-config.toml
version
Show version information.
kreuzberg version [--format <FORMAT>]
Options:
--format <FORMAT>: Output format (textorjson), default:text
Examples:
# Display version
kreuzberg version
# JSON output
kreuzberg version --format json
Output Formats
Text Format
The default human-readable format:
kreuzberg extract document.pdf
# Output:
# Document content here...
JSON Format
For programmatic integration:
kreuzberg extract document.pdf --format json
# Output:
# {
# "content": "Document content...",
# "mime_type": "application/pdf",
# "metadata": { "title": "...", "author": "..." },
# "tables": [{ "markdown": "...", "cells": [...], "page_number": 0 }]
# }
Supported File Formats
| Category | Formats |
|---|---|
| Documents | PDF, DOCX, DOC, PPTX, PPT, XLSX, XLS, ODT, ODP, ODS, RTF |
| Images | PNG, JPEG, JPG, WEBP, BMP, TIFF, GIF |
| Web | HTML, XHTML, XML |
| Text | TXT, MD, CSV, TSV, JSON, YAML, TOML |
| EML, MSG | |
| Archives | ZIP, TAR, 7Z |
| Other | 30+ additional formats |
Exit Codes
0: Successful executionNon-zero: Error occurred (check stderr for details)
Logging
Control logging verbosity with the RUST_LOG environment variable:
# Show info-level logs (default)
RUST_LOG=info kreuzberg extract document.pdf
# Show detailed debug logs
RUST_LOG=debug kreuzberg extract document.pdf
# Show only warnings and errors
RUST_LOG=warn kreuzberg extract document.pdf
# Suppress all logs
RUST_LOG=error kreuzberg extract document.pdf
# Show logs from specific modules
RUST_LOG=kreuzberg=debug kreuzberg extract document.pdf
Performance Tips
-
Use batch processing for multiple files instead of sequential extraction:
kreuzberg batch *.pdf # Parallel processing -
Enable caching to avoid reprocessing the same documents:
# Cache is enabled by default kreuzberg extract document.pdf -
Use appropriate chunk sizes for LLM processing:
kreuzberg extract long.pdf --chunk true --chunk-size 2000 -
Tune OCR settings for better performance:
kreuzberg extract scanned.pdf --ocr true # Adjust tesseract_config in configuration file for optimization -
Monitor cache size and clear when needed:
kreuzberg cache stats kreuzberg cache clear
Features
Default Features
None by default. The binary includes core extraction.
Optional Features
api: Enable the REST API server (kreuzberg servecommand)mcp: Enable Model Context Protocol server (kreuzberg mcpcommand)all: Enable all features (api+mcp)
Building with Features
# Build with all features
cargo build --release -p kreuzberg-cli --features all
# Build with specific features
cargo build --release -p kreuzberg-cli --features api,mcp
Troubleshooting
File Not Found Error
Ensure the file path is correct and the file is readable:
# Check if file exists
ls -l /path/to/document.pdf
# Try with absolute path
kreuzberg extract /absolute/path/to/document.pdf
OCR Not Working
Verify Tesseract is installed:
tesseract --version
# If not found:
# macOS: brew install tesseract
# Ubuntu: sudo apt-get install tesseract-ocr
# Windows: Download from https://github.com/tesseract-ocr/tesseract
Configuration File Not Found
Check that the configuration file has the correct format and location:
# Use explicit path
kreuzberg extract document.pdf --config /absolute/path/to/config.toml
# Or place kreuzberg.toml in current directory
ls -l kreuzberg.toml
Out of Memory with Large Files
Use chunking to reduce memory usage:
kreuzberg extract large-document.pdf --chunk true --chunk-size 1000
Cache Directory Permissions
Ensure write access to the cache directory:
# Check permissions
ls -ld .kreuzberg
# Or use a custom directory with appropriate permissions
kreuzberg extract document.pdf --config config.toml
# In config.toml: cache_dir = "/tmp/kreuzberg-cache"
Key Files
src/main.rs: CLI implementation with command definitions and argument parsingCargo.toml: Package metadata and dependencies
Building
Development Build
cargo build -p kreuzberg-cli
Release Build
cargo build --release -p kreuzberg-cli
With All Features
cargo build --release -p kreuzberg-cli --features all
Testing
# Run CLI tests
cargo test -p kreuzberg-cli
# With logging
RUST_LOG=debug cargo test -p kreuzberg-cli -- --nocapture
Performance Characteristics
- Single file extraction: Typically 10-100ms depending on file size and format
- Batch processing: Near-linear scaling with 8 concurrent extractions by default
- OCR processing: 100-500ms per page depending on image quality and language
- Caching: Sub-millisecond retrieval for cached results
References
- Kreuzberg Core:
../kreuzberg/ - Main Documentation: https://docs.kreuzberg.dev
- GitHub Repository: https://github.com/kreuzberg-dev/kreuzberg
- Configuration Guide: See example configuration sections above
Contributing
We welcome contributions! Please see the main Kreuzberg repository for contribution guidelines.
License
MIT
Dependencies
~20–54MB
~839K SLoC