#web-scraping #web-crawler #rag #tui #web

bin+lib rust_scraper

Production-ready web scraper with Clean Architecture, TUI selector, and sitemap support

1 stable release

new 1.0.0 Mar 10, 2026

#2399 in Command line utilities

MIT/Apache

340KB
6K SLoC

πŸ•·οΈ Rust Scraper

Production-ready web scraper with Clean Architecture, TUI selector, and sitemap support.

Build Status Tests License Rust Version

✨ Features

πŸš€ Core

  • Async Web Scraping: Multi-threaded with Tokio runtime
  • Sitemap Support: Zero-allocation streaming parser
    • Gzip decompression (.xml.gz)
    • Sitemap index recursion (max depth 3)
    • Auto-discovery from robots.txt
  • TUI Interactivo: Select URLs before downloading
    • Checkbox selection ([βœ…] / [⬜])
    • Keyboard navigation (↑↓, Space, Enter)
    • Confirmation mode (Y/N)

πŸ—οΈ Architecture

  • Clean Architecture: Domain β†’ Application β†’ Infrastructure β†’ Adapters
  • Error Handling: thiserror for libraries, anyhow for applications
  • Dependency Injection: HTTP client, user agents, concurrency config

⚑ Performance

  • True Streaming: Constant ~8KB RAM, no OOM
  • Zero-Allocation Parsing: quick-xml for sitemaps
  • LazyLock Cache: Syntax highlighting (2-10ms β†’ ~0.01ms)
  • Bounded Concurrency: Configurable parallel downloads

πŸ”’ Security

  • SSRF Prevention: URL host comparison (not string contains)
  • Windows Safe: Reserved names blocked (CON β†’ CON_safe)
  • WAF Bypass Prevention: Chrome 131+ UAs with TTL caching
  • RFC 3986 URLs: url::Url::parse() validation

πŸ“¦ Installation

From Source

git clone https://github.com/XaviCode1000/rust-scraper.git
cd rust-scraper
cargo build --release

The binary will be available at target/release/rust_scraper.

From Cargo (coming soon)

cargo install rust_scraper

πŸš€ Usage

Basic (Headless Mode)

# Scrape all URLs from a website
./target/release/rust_scraper --url https://example.com

# With sitemap (auto-discovers from robots.txt)
./target/release/rust_scraper --url https://example.com --use-sitemap

# Explicit sitemap URL
./target/release/rust_scraper --url https://example.com \
  --use-sitemap \
  --sitemap-url https://example.com/sitemap.xml.gz

Interactive Mode (TUI)

# Select URLs interactively before downloading
./target/release/rust_scraper --url https://example.com --interactive

# With sitemap
./target/release/rust_scraper --url https://example.com \
  --interactive \
  --use-sitemap

TUI Controls

Key Action
↑↓ Navigate URLs
Space Toggle selection
A Select all
D Deselect all
Enter Confirm download
Y/N Final confirmation
q Quit

Advanced Options

# Full example with all options
./target/release/rust_scraper \
  --url https://example.com \
  --output ./output \
  --format markdown \
  --download-images \
  --download-documents \
  --use-sitemap \
  --concurrency 5 \
  --delay-ms 1000 \
  --max-pages 100 \
  --verbose

RAG Export Pipeline (JSONL Format)

Export content in JSON Lines format, optimized for RAG (Retrieval-Augmented Generation) pipelines.

# Export to JSONL (one JSON object per line)
./target/release/rust_scraper --url https://example.com --export-format jsonl --output ./rag_data

# Resume interrupted scraping (skip already processed URLs)
./target/release/rust_scraper --url https://example.com --export-format jsonl --output ./rag_data --resume

# Custom state directory (isolate state per project)
./target/release/rust_scraper --url https://example.com --export-format jsonl --output ./rag_data --state-dir ./state --resume

JSONL Schema

Each line is a valid JSON object with the following structure:

{
  "id": "uuid-v4",
  "url": "https://example.com/page",
  "title": "Page Title",
  "content": "Extracted content...",
  "metadata": {
    "domain": "example.com",
    "excerpt": "Meta description or excerpt"
  },
  "timestamp": "2026-03-09T10:00:00Z"
}

State Management

  • Location: ~/.cache/rust-scraper/state/<domain>.json
  • Tracks: Processed URLs, timestamps, status
  • Atomic saves: Write to tmp + rename (crash-safe)
  • Resume mode: --resume flag enables state tracking

RAG Integration

JSONL format is compatible with:

  • Qdrant: Load via Python SDK
  • Weaviate: Batch import
  • Pinecone: Upsert from JSONL
  • LangChain: JSONLoader for document loading
# Example: Load JSONL with LangChain
from langchain.document_loaders import JSONLoader

loader = JSONLoader(
    file_path='./rag_data/export.jsonl',
    jq_schema='.content',
    text_content=False
)
documents = loader.load()

Get Help

./target/release/rust_scraper --help

πŸ“– Documentation

πŸ§ͺ Testing

# Run all tests
cargo test

# Run with output
cargo test -- --nocapture

# Run specific test
cargo test test_validate_and_parse_url

Tests: 216 passing βœ…

πŸ—οΈ Architecture

Domain (entities, errors)
    ↓
Application (services, use cases)
    ↓
Infrastructure (HTTP, parsers, converters)
    ↓
Adapters (TUI, CLI, detectors)

Dependency Rule: Dependencies point inward. Domain never imports frameworks.

See docs/ARCHITECTURE.md for detailed architecture documentation.

πŸ”§ Development

Requirements

  • Rust 1.80+
  • Cargo

Build

# Debug build
cargo build

# Release build (optimized)
cargo build --release

Lint

# Run Clippy (deny warnings)
cargo clippy -- -D warnings

# Check formatting
cargo fmt --all -- --check

Run

# Run in debug mode
cargo run -- --url https://example.com

# Run in release mode
cargo run --release -- --url https://example.com

πŸ“„ License

Licensed under either of:

at your option.

πŸ™ Acknowledgments

πŸ“Š Stats

  • Lines of Code: ~4000+
  • Tests: 216 passing
  • Coverage: High (state-based testing)
  • MSRV: 1.80.0

πŸ—ΊοΈ Roadmap

  • v1.1.0: Multi-domain crawling
  • v1.2.0: JavaScript rendering (headless browser)
  • v2.0.0: Distributed scraping

Made with ❀️ using Rust and Clean Architecture

Dependencies

~37–59MB
~877K SLoC