Production-ready document processing CLI for RAG applications
Process documents, extract text with advanced OCR, chunk intelligently, and prepare data for RAG systems - all from the command line with ragctl.
ragctl is a command-line tool for processing documents into chunks ready for Retrieval-Augmented Generation (RAG) systems. It handles the dirty work of document ingestion, OCR, and intelligent chunking so you can focus on building your RAG application.
Key capabilities:
- Universal document loading (PDF, DOCX, images, HTML, Markdown, etc.)
- Advanced OCR with automatic fallback (EasyOCR → PaddleOCR → pytesseract)
- Intelligent semantic chunking using LangChain
- Production-ready batch processing with auto-retry
- Multiple export formats (JSON, JSONL, CSV)
- Direct ingestion into Qdrant vector store
- Supported formats: PDF, DOCX, ODT, TXT, HTML, Markdown, Images (JPEG, PNG)
- Smart OCR cascade:
- EasyOCR (best quality, multi-language)
- PaddleOCR (fast, good for complex layouts)
- pytesseract (fallback, most tolerant)
- Quality detection: Automatically rejects unreadable documents
- Multi-language: French, English, German, Spanish, Italian, Portuguese, and more
- Semantic chunking: Context-aware text splitting using LangChain RecursiveCharacterTextSplitter
- Multiple strategies:
semantic- Smart splitting by meaning (default)sentence- Split by sentencestoken- Fixed token-based splitting
- Configurable: Token limits (50-2000), overlap (0-500), model selection
- Rich metadata: Source file, chunk index, token count, strategy, timestamps
- Automatic retry: Up to 3 attempts with exponential backoff (1s, 2s, 4s...)
- Interactive error handling:
interactive- Prompt user on each error (default)auto-continue- Continue on errors (CI/CD mode)auto-stop- Stop on first error (validation mode)auto-skip- Skip failed files automatically
- Complete history: Every run saved to
~/.ragctl/history/ - Retry capability:
ragctl retryto rerun failed files only - Per-file output: One chunk file per document for better traceability
- Export formats: JSON, JSONL (streaming), CSV (Excel-compatible)
- Vector store integration: Direct ingestion into Qdrant
- No database required: Pure file-based export for easy sharing
- Hierarchical config: CLI flags > Environment variables > YAML file > Defaults
- Example config:
config.example.ymlwith detailed documentation - Easy customization: Override any setting via command line
# Install from PyPI
pip install ragctl
# Verify installation
ragctl --version# Clone repository
git clone [email protected]:datallmhub/ragctl.git
cd ragctl
# Install with pip
pip install -e .
# Verify installation
ragctl --version# Process a single document
ragctl chunk document.pdf --show
# Process with advanced OCR for scanned documents
ragctl chunk scanned.pdf --advanced-ocr -o chunks.json
# Batch process a folder
ragctl batch ./documents --output ./chunks/
# Batch with auto-retry for CI/CD
ragctl batch ./documents --output ./chunks/ --auto-continue# Simple text file
ragctl chunk document.txt --show
# PDF with semantic chunking (default)
ragctl chunk report.pdf -o report_chunks.json
# Scanned image with OCR
ragctl chunk contract.jpeg --advanced-ocr --show
# Custom chunking parameters
ragctl chunk document.pdf \
--strategy semantic \
--max-tokens 500 \
--overlap 100 \
-o output.jsonl# Process all files in a directory
ragctl batch ./documents --output ./chunks/
# Process only PDFs recursively
ragctl batch ./documents \
--pattern "*.pdf" \
--recursive \
--output ./chunks/
# CI/CD mode - continue on errors
ragctl batch ./documents \
--output ./chunks/ \
--auto-continue \
--save-history
# Per-file output (default):
# chunks/
# ├── doc1_chunks.jsonl (25 chunks)
# ├── doc2_chunks.jsonl (42 chunks)
# └── doc3_chunks.jsonl (18 chunks)
# Single-file output (all chunks combined):
ragctl batch ./documents \
--output ./all_chunks.jsonl \
--single-file# Show last failed run
ragctl retry --show
# Retry all failed files from last run
ragctl retry
# Retry specific run by ID
ragctl retry run_20251028_133403# Ingest chunks into Qdrant
ragctl ingest chunks.jsonl \
--collection my-docs \
--url http://localhost:6333
# Get system info
ragctl info# Evaluate chunking strategy
ragctl eval document.pdf \
--strategies semantic sentence token \
--metrics coverage overlap coherence
# Compare strategies with visualization
ragctl eval document.pdf --compare --output eval_results.json| Document | Description |
|---|---|
| Getting Started | Installation and first steps |
| CLI Guide | Complete command reference |
| Security | Security features and best practices |
| Full Documentation | Complete documentation index |
Create ~/.ragctl/config.yml or use CLI flags:
# OCR settings
ocr:
use_advanced_ocr: false
enable_fallback: true
# Chunking settings
chunking:
strategy: semantic
max_tokens: 400
overlap: 50
# Output settings
output:
format: jsonl
include_metadata: true
pretty_print: trueConfiguration hierarchy: CLI flags > Environment variables > YAML config > Defaults
# Run all tests
make test
# Run CLI tests
make test-cli
# Quick validation
ragctl --version
ragctl chunk tests/data/sample.txt --showTest Coverage: 496 tests, 41% coverage
- Text documents: ~100-200 docs/minute
- PDFs with OCR: ~5-10 docs/minute (depends on page count)
- Batch processing: Parallel-ready with retry mechanism
- OCR accuracy: 95%+ with EasyOCR on clear scans
- Chunk quality: 90% readability threshold enforced
- Semantic coherence: LangChain's RecursiveCharacterTextSplitter optimized for context
| Command | Description |
|---|---|
ragctl chunk |
Process a single document |
ragctl batch |
Batch process multiple files |
ragctl retry |
Retry failed files from history |
ragctl ingest |
Ingest chunks into Qdrant |
ragctl eval |
Evaluate chunking quality |
ragctl info |
System information |
Run ragctl COMMAND --help for detailed options.
NumPy incompatibility
# For OCR support, use NumPy 1.x
pip install "numpy<2.0"Missing system dependencies
# Ubuntu/Debian
sudo apt-get install tesseract-ocr poppler-utils
# macOS
brew install tesseract poppler"Document unreadable" errors
- Try lowering quality threshold:
--ocr-threshold 0.2 - Use advanced OCR:
--advanced-ocr - Check document is not corrupted
Import errors
# Reinstall dependencies
pip install -e .More help: Getting Started Guide
# Install dev dependencies
make install-dev
# Format code
make format
# Run linters
make lint
# Install pre-commit hooks
make pre-commit-install
# Run all CI checks
make ci-allThis project is licensed under the MIT License - see the LICENSE file for details.
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
Please read CONTRIBUTING.md for details on our code of conduct, and the process for submitting pull requests to us.
- Documentation: docs/
- Issues: GitHub Issues
- Discussions: GitHub Discussions
Built with:
- LangChain - Text splitting and document loading
- EasyOCR - OCR engine
- PaddleOCR - Alternative OCR engine
- Unstructured - Document parsing
- Typer - CLI framework
- Rich - Terminal formatting
Version: 0.1.4 | Status: Beta | License: MIT