1 stable release
| new 1.0.0 | Mar 10, 2026 |
|---|
#2399 in Command line utilities
340KB
6K
SLoC
π·οΈ Rust Scraper
Production-ready web scraper with Clean Architecture, TUI selector, and sitemap support.
β¨ Features
π Core
- Async Web Scraping: Multi-threaded with Tokio runtime
- Sitemap Support: Zero-allocation streaming parser
- Gzip decompression (
.xml.gz) - Sitemap index recursion (max depth 3)
- Auto-discovery from
robots.txt
- Gzip decompression (
- TUI Interactivo: Select URLs before downloading
- Checkbox selection (
[β ]/[β¬]) - Keyboard navigation (ββ, Space, Enter)
- Confirmation mode (Y/N)
- Checkbox selection (
ποΈ Architecture
- Clean Architecture: Domain β Application β Infrastructure β Adapters
- Error Handling:
thiserrorfor libraries,anyhowfor applications - Dependency Injection: HTTP client, user agents, concurrency config
β‘ Performance
- True Streaming: Constant ~8KB RAM, no OOM
- Zero-Allocation Parsing:
quick-xmlfor sitemaps - LazyLock Cache: Syntax highlighting (2-10ms β ~0.01ms)
- Bounded Concurrency: Configurable parallel downloads
π Security
- SSRF Prevention: URL host comparison (not string contains)
- Windows Safe: Reserved names blocked (
CONβCON_safe) - WAF Bypass Prevention: Chrome 131+ UAs with TTL caching
- RFC 3986 URLs:
url::Url::parse()validation
π¦ Installation
From Source
git clone https://github.com/XaviCode1000/rust-scraper.git
cd rust-scraper
cargo build --release
The binary will be available at target/release/rust_scraper.
From Cargo (coming soon)
cargo install rust_scraper
π Usage
Basic (Headless Mode)
# Scrape all URLs from a website
./target/release/rust_scraper --url https://example.com
# With sitemap (auto-discovers from robots.txt)
./target/release/rust_scraper --url https://example.com --use-sitemap
# Explicit sitemap URL
./target/release/rust_scraper --url https://example.com \
--use-sitemap \
--sitemap-url https://example.com/sitemap.xml.gz
Interactive Mode (TUI)
# Select URLs interactively before downloading
./target/release/rust_scraper --url https://example.com --interactive
# With sitemap
./target/release/rust_scraper --url https://example.com \
--interactive \
--use-sitemap
TUI Controls
| Key | Action |
|---|---|
ββ |
Navigate URLs |
Space |
Toggle selection |
A |
Select all |
D |
Deselect all |
Enter |
Confirm download |
Y/N |
Final confirmation |
q |
Quit |
Advanced Options
# Full example with all options
./target/release/rust_scraper \
--url https://example.com \
--output ./output \
--format markdown \
--download-images \
--download-documents \
--use-sitemap \
--concurrency 5 \
--delay-ms 1000 \
--max-pages 100 \
--verbose
RAG Export Pipeline (JSONL Format)
Export content in JSON Lines format, optimized for RAG (Retrieval-Augmented Generation) pipelines.
# Export to JSONL (one JSON object per line)
./target/release/rust_scraper --url https://example.com --export-format jsonl --output ./rag_data
# Resume interrupted scraping (skip already processed URLs)
./target/release/rust_scraper --url https://example.com --export-format jsonl --output ./rag_data --resume
# Custom state directory (isolate state per project)
./target/release/rust_scraper --url https://example.com --export-format jsonl --output ./rag_data --state-dir ./state --resume
JSONL Schema
Each line is a valid JSON object with the following structure:
{
"id": "uuid-v4",
"url": "https://example.com/page",
"title": "Page Title",
"content": "Extracted content...",
"metadata": {
"domain": "example.com",
"excerpt": "Meta description or excerpt"
},
"timestamp": "2026-03-09T10:00:00Z"
}
State Management
- Location:
~/.cache/rust-scraper/state/<domain>.json - Tracks: Processed URLs, timestamps, status
- Atomic saves: Write to tmp + rename (crash-safe)
- Resume mode:
--resumeflag enables state tracking
RAG Integration
JSONL format is compatible with:
- Qdrant: Load via Python SDK
- Weaviate: Batch import
- Pinecone: Upsert from JSONL
- LangChain:
JSONLoaderfor document loading
# Example: Load JSONL with LangChain
from langchain.document_loaders import JSONLoader
loader = JSONLoader(
file_path='./rag_data/export.jsonl',
jq_schema='.content',
text_content=False
)
documents = loader.load()
Get Help
./target/release/rust_scraper --help
π Documentation
- Usage Guide - Detailed examples and troubleshooting
- Architecture - Clean Architecture details
- API Docs - Rust documentation
π§ͺ Testing
# Run all tests
cargo test
# Run with output
cargo test -- --nocapture
# Run specific test
cargo test test_validate_and_parse_url
Tests: 216 passing β
ποΈ Architecture
Domain (entities, errors)
β
Application (services, use cases)
β
Infrastructure (HTTP, parsers, converters)
β
Adapters (TUI, CLI, detectors)
Dependency Rule: Dependencies point inward. Domain never imports frameworks.
See docs/ARCHITECTURE.md for detailed architecture documentation.
π§ Development
Requirements
- Rust 1.80+
- Cargo
Build
# Debug build
cargo build
# Release build (optimized)
cargo build --release
Lint
# Run Clippy (deny warnings)
cargo clippy -- -D warnings
# Check formatting
cargo fmt --all -- --check
Run
# Run in debug mode
cargo run -- --url https://example.com
# Run in release mode
cargo run --release -- --url https://example.com
π License
Licensed under either of:
- Apache License, Version 2.0 (LICENSE-APACHE)
- MIT license (LICENSE-MIT)
at your option.
π Acknowledgments
- Built with Clean Architecture principles
- Inspired by ripgrep performance patterns
- Uses rust-skills (179 rules)
π Stats
- Lines of Code: ~4000+
- Tests: 216 passing
- Coverage: High (state-based testing)
- MSRV: 1.80.0
πΊοΈ Roadmap
- v1.1.0: Multi-domain crawling
- v1.2.0: JavaScript rendering (headless browser)
- v2.0.0: Distributed scraping
Made with β€οΈ using Rust and Clean Architecture
Dependencies
~37β59MB
~877K SLoC