Readme
π·οΈ Rust Scraper
Production-ready web scraper with Clean Architecture, TUI selector, and sitemap support.
β¨ Features
π Core
Async Web Scraping : Multi-threaded with Tokio runtime
Sitemap Support : Zero-allocation streaming parser
Gzip decompression (. xml. gz)
Sitemap index recursion (max depth 3)
Auto-discovery from robots. txt
TUI Interactivo : Select URLs before downloading
Checkbox selection ([ β
] / [ β¬] )
Keyboard navigation (ββ, Space, Enter)
Confirmation mode (Y/N)
ποΈ Architecture
Clean Architecture : Domain β Application β Infrastructure β Adapters
Error Handling : thiserror for libraries, anyhow for applications
Dependency Injection : HTTP client, user agents, concurrency config
True Streaming : Constant ~8KB RAM, no OOM
Zero-Allocation Parsing : quick-xml for sitemaps
LazyLock Cache : Syntax highlighting (2-10ms β ~0.01ms)
Bounded Concurrency : Configurable parallel downloads
π Security
SSRF Prevention : URL host comparison (not string contains)
Windows Safe : Reserved names blocked (CON β CON_safe )
WAF Bypass Prevention : Chrome 131+ UAs with TTL caching
RFC 3986 URLs : url:: Url:: parse( ) validation
π¦ Installation
From Source
git clone https://github.com/XaviCode1000/rust-scraper.git
cd rust-scraper
cargo build -- release
The binary will be available at target/ release/ rust_scraper .
From Cargo (coming soon)
cargo install rust_scraper
π Usage
Basic (Headless Mode)
# Scrape all URLs from a website
./target/release/rust_scraper --url https://example.com
# With sitemap (auto-discovers from robots.txt)
./target/release/rust_scraper --url https://example.com --use-sitemap
# Explicit sitemap URL
./target/release/rust_scraper --url https://example.com \
--use-sitemap \
--sitemap-url https://example.com/sitemap.xml.gz
Interactive Mode (TUI)
# Select URLs interactively before downloading
./target/release/rust_scraper --url https://example.com --interactive
# With sitemap
./target/release/rust_scraper --url https://example.com \
--interactive \
--use-sitemap
TUI Controls
Key
Action
ββ
Navigate URLs
Space
Toggle selection
A
Select all
D
Deselect all
Enter
Confirm download
Y/ N
Final confirmation
q
Quit
Advanced Options
# Full example with all options
./target/release/rust_scraper \
--url https://example.com \
--output ./output \
--format markdown \
--download-images \
--download-documents \
--use-sitemap \
--concurrency 5 \
--delay-ms 1000 \
--max-pages 100 \
--verbose
Export content in JSON Lines format, optimized for RAG (Retrieval-Augmented Generation) pipelines.
# Export to JSONL (one JSON object per line)
./target/release/rust_scraper --url https://example.com --export-format jsonl --output ./rag_data
# Resume interrupted scraping (skip already processed URLs)
./target/release/rust_scraper --url https://example.com --export-format jsonl --output ./rag_data --resume
# Custom state directory (isolate state per project)
./target/release/rust_scraper --url https://example.com --export-format jsonl --output ./rag_data --state-dir ./state --resume
JSONL Schema
Each line is a valid JSON object with the following structure:
{
" id" : " uuid-v4" ,
" url" : " https://example.com/page" ,
" title" : " Page Title" ,
" content" : " Extracted content..." ,
" metadata" : {
" domain" : " example.com" ,
" excerpt" : " Meta description or excerpt"
} ,
" timestamp" : " 2026-03-09T10:00:00Z"
}
State Management
Location : ~/.cache/rust-scraper/state/<domain>.json
Tracks : Processed URLs, timestamps, status
Atomic saves : Write to tmp + rename (crash-safe)
Resume mode : --resume flag enables state tracking
RAG Integration
JSONL format is compatible with:
Qdrant : Load via Python SDK
Weaviate : Batch import
Pinecone : Upsert from JSONL
LangChain : JSONLoader for document loading
# Example: Load JSONL with LangChain
from langchain.document_loaders import JSONLoader
loader = JSONLoader(
file_path='./rag_data/export.jsonl',
jq_schema='.content',
text_content=False
)
documents = loader.load()
Get Help
./target/release/rust_scraper -- help
π Documentation
π§ͺ Testing
# Run all tests
cargo test
# Run with output
cargo test -- --nocapture
# Run specific test
cargo test test_validate_and_parse_url
Tests: 216 passing β
ποΈ Architecture
Domain ( entities, errors)
β
Application ( services, use cases)
β
Infrastructure ( HTTP , parsers, converters)
β
Adapters ( TUI , CLI , detectors)
Dependency Rule: Dependencies point inward. Domain never imports frameworks.
See docs/ARCHITECTURE.md for detailed architecture documentation.
π§ Development
Requirements
Build
# Debug build
cargo build
# Release build (optimized)
cargo build --release
Lint
# Run Clippy (deny warnings)
cargo clippy -- -D warnings
# Check formatting
cargo fmt --all -- --check
Run
# Run in debug mode
cargo run -- --url https://example.com
# Run in release mode
cargo run --release -- --url https://example.com
π License
Licensed under either of:
at your option.
π Acknowledgments
π Stats
Lines of Code: ~4000+
Tests: 216 passing
Coverage: High (state-based testing)
MSRV: 1.80.0
πΊοΈ Roadmap
Made with β€οΈ using Rust and Clean Architecture