Production-ready web scraper with Clean Architecture, TUI selector, and AI-powered semantic cleaning.
- Features
- Installation
- Usage
- Testing
- Architecture
- Documentation
- Development
- Bug Fixes
- Contributing
- License
- Async Web Scraping β Multi-threaded with Tokio runtime, bounded concurrency
- Sitemap Support β Zero-allocation streaming parser (
quick-xml)- Gzip decompression (
.xml.gz) viaasync-compression - Sitemap index recursion (max depth 3)
- Auto-discovery from
robots.txt
- Gzip decompression (
- TUI Interactive Selector β Ratatui + crossterm URL picker
- Checkbox selection (
[β ]/[β¬]) - Keyboard navigation (ββ, Space, Enter)
- Confirmation mode (Y/N) before download
- Checkbox selection (
- RAG Export Pipeline β JSONL format optimized for Retrieval-Augmented Generation
- State management with resume capability
- Atomic saves (write to tmp + rename)
- Compatible with Qdrant, Weaviate, Pinecone, LangChain
- Semantic Cleaning β Local SLM inference (100% privacy, no API calls)
- 87% accuracy vs 13% fixed-size chunking
- AVX2 SIMD acceleration (4-8x speedup on CachyOS)
- β Embeddings Preservation Bug Fixed β See Bug Fixes
- See
docs/AI-SEMANTIC-CLEANING.md
- Clean Architecture β 4 layers: Domain β Application β Infrastructure β Adapters
- Error Handling β
thiserrorfor libraries,anyhowfor applications - Dependency Injection β HTTP client, user agents, concurrency config
- Type-Safe APIs β Newtypes for IDs, validated types at boundaries
- True Streaming β Constant ~8KB RAM usage, no OOM risks
- Zero-Allocation Parsing β
quick-xmlfor sitemaps - LazyLock Cache β Syntax highlighting (2-10ms β ~0.01ms)
- Bounded Concurrency β Configurable parallel downloads (HDD-aware defaults)
- Hardware-Aware β Auto-detects CPU cores, adjusts concurrency accordingly
- SSRF Prevention β URL host comparison (not string contains)
- Windows Safe β Reserved names blocked (
CONβCON_safe) - WAF Bypass Prevention β Chrome 131+ UAs with TTL caching
- RFC 3986 URLs β
url::Url::parse()validation - Input Validation β All user input validated at boundaries
git clone https://github.com/XaviCode1000/rust-scraper.git
cd rust-scraper
cargo build --releaseBinary location: target/release/rust_scraper
- Rust: 1.80+ (MSRV)
- Cargo: 1.80+
- Optional (AI features): CMake, C++17 for
tract-onnx
| Feature | Description | Dependencies |
|---|---|---|
images |
Enable image downloading | mimetype-detector |
documents |
Enable document downloading | mimetype-detector |
full |
All features except AI | images, documents, zvec |
ai |
AI-powered semantic cleaning | tract-onnx, tokenizers, hf-hub, ort |
Build with AI features:
cargo build --release --features ai# Scrape all URLs from a website
./target/release/rust_scraper --url https://example.com
# With sitemap (auto-discovers from robots.txt)
./target/release/rust_scraper --url https://example.com --use-sitemap
# Explicit sitemap URL
./target/release/rust_scraper --url https://example.com \
--use-sitemap \
--sitemap-url https://example.com/sitemap.xml.gz# Select URLs interactively before downloading
./target/release/rust_scraper --url https://example.com --interactive
# With sitemap
./target/release/rust_scraper --url https://example.com \
--interactive \
--use-sitemap| Key | Action |
|---|---|
β / β |
Navigate URLs |
Space |
Toggle selection |
A |
Select all |
D |
Deselect all |
Enter |
Confirm download |
Y / N |
Final confirmation |
q |
Quit |
# Full example with all options
./target/release/rust_scraper \
--url https://example.com \
--output ./output \
--format markdown \
--download-images \
--download-documents \
--use-sitemap \
--concurrency 5 \
--delay-ms 1000 \
--max-pages 100 \
--verbose
# Hardware-aware concurrency (auto-detects CPU)
./target/release/rust_scraper \
--url https://example.com \
--concurrency auto# Enable AI semantic cleaning
./target/release/rust_scraper \
--url https://example.com \
--clean-ai \
--ai-threshold 0.3 \
--export-format jsonl
# Custom AI model (advanced)
./target/release/rust_scraper \
--url https://example.com \
--clean-ai \
--ai-model sentence-transformers/all-MiniLM-L6-v2Requirements: Compile with --features ai
Export content in JSON Lines format, optimized for RAG (Retrieval-Augmented Generation) pipelines.
# Export to JSONL (one JSON object per line)
./target/release/rust_scraper \
--url https://example.com \
--export-format jsonl \
--output ./rag_data
# Resume interrupted scraping (skip already processed URLs)
./target/release/rust_scraper \
--url https://example.com \
--export-format jsonl \
--output ./rag_data \
--resume
# Custom state directory (isolate state per project)
./target/release/rust_scraper \
--url https://example.com \
--export-format jsonl \
--output ./rag_data \
--state-dir ./state \
--resumeEach line is a valid JSON object:
{
"id": "uuid-v4",
"url": "https://example.com/page",
"title": "Page Title",
"content": "Extracted content...",
"metadata": {
"domain": "example.com",
"excerpt": "Meta description or excerpt"
},
"timestamp": "2026-03-11T10:00:00Z"
}- Location:
~/.cache/rust-scraper/state/<domain>.json - Tracks: Processed URLs, timestamps, status
- Atomic saves: Write to tmp + rename (crash-safe)
- Resume mode:
--resumeflag enables state tracking
JSONL format is compatible with:
# Example: Load JSONL with LangChain
from langchain.document_loaders import JSONLoader
loader = JSONLoader(
file_path='./rag_data/export.jsonl',
jq_schema='.content',
text_content=False
)
documents = loader.load()./target/release/rust_scraper --help# Run all tests (281 total)
cargo test
# Run with output
cargo test -- --nocapture
# Run specific test
cargo test test_validate_and_parse_url
# Run AI integration tests (64 tests)
cargo test --features ai --test ai_integration -- --test-threads=2
# Run library tests only (217 tests)
cargo test --lib| Test Suite | Count | Status |
|---|---|---|
| Library Tests | 217 | β Passing |
| AI Integration | 64 | β Passing |
| Total | 281 | β All Passing |
Note: AI tests require --features ai and run with --test-threads=2 for stability.
βββββββββββββββββββββββββββββββββββββββββββ
β Adapters (TUI, CLI, Detectors) β β External interfaces
βββββββββββββββββββββββββββββββββββββββββββ€
β Infrastructure (HTTP, Parsers, AI) β β Technical implementations
βββββββββββββββββββββββββββββββββββββββββ-β€
β Application (Services, Use Cases) β β Business orchestration
βββββββββββββββββββββββββββββββββββββββββ-β€
β Domain (Entities, Value Objects) β β Pure business logic
βββββββββββββββββββββββββββββββββββββββββββ
Dependency Rule: Dependencies point inward. Domain never imports frameworks.
| Layer | Purpose | Dependencies |
|---|---|---|
| Domain | Core entities, value objects, business rules | None (pure Rust) |
| Application | Use cases, services, orchestration | Domain |
| Infrastructure | HTTP, parsers, AI, exporters | Domain, Application |
| Adapters | TUI, CLI, external integrations | All layers |
- Builder Pattern β
CrawlerConfig::builder(),ScraperConfig::default() - Repository Pattern β
Exportertrait for different output formats - Strategy Pattern β Pluggable semantic cleaning strategies
- Typestate Pattern β Compile-time state validation
See docs/ARCHITECTURE.md for detailed architecture documentation.
| Document | Description |
|---|---|
docs/USAGE.md |
Detailed usage examples and troubleshooting |
docs/ARCHITECTURE.md |
Clean Architecture design decisions |
docs/AI-SEMANTIC-CLEANING.md |
AI-powered content extraction (v1.0.5+) |
docs/RAG-EXPORT.md |
RAG export pipeline and JSONL format |
docs/CLI.md |
Complete CLI reference |
docs/CONTRIBUTING.md |
Contribution guidelines |
docs/CHANGES.md |
Changelog and version history |
API Documentation:
cargo doc --openOnline docs: https://docs.rs/rust_scraper
- Rust: 1.80+ (MSRV)
- Cargo: 1.80+
- Optional: CMake, C++17, liblz4-dev (for
zvecfeature)
# Debug build
cargo build
# Release build (optimized)
cargo build --release
# With AI features
cargo build --release --features ai
# Full features
cargo build --release --features full,ai# Run Clippy (deny warnings)
cargo clippy -- -D warnings
# Check formatting
cargo fmt --all -- --check
# Run all checks
cargo clippy --all-targets --all-features -- -D warnings# Run in debug mode
cargo run -- --url https://example.com
# Run in release mode
cargo run --release -- --url https://example.com
# With AI features
cargo run --release --features ai -- --url https://example.com --clean-ai# Limit parallel jobs (4C/4T CPU)
cargo test --test-threads=2
# I/O-heavy operations (HDD optimization)
ionice -c 3 cargo build
# Profile-guided optimization (PGO)
cargo +nightly build --release -Z build-std[profile.release]
opt-level = 3
lto = "fat"
codegen-units = 1
panic = "abort"
strip = trueProblem: AI semantic cleaner was discarding embedding vectors during relevance filtering.
Symptoms:
- Log: "Generated 0 chunks with embeddings"
- JSONL output:
embeddings: nullfor all chunks - Data loss: 49,536 dimensions of embedding vectors lost
Root Cause:
filter_by_relevance()was not preserving embeddings after filtering- Ownership transfer issues caused unnecessary cloning
Solution:
- Modified
filter_by_relevance()to usefilter_with_embeddings() - Restored embeddings after filtering before returning output
- Added integration test to validate embeddings are present
- Optimized ownership transfer using
with_embeddings()builder pattern - Eliminated unnecessary chunk cloning (50-100% performance improvement)
Impact:
- β 149 chunks with embeddings: Now preserved
- β 49,536 dimensions: No longer lost
- π Memory usage: Reduced by ~50% in hot path
- β‘ Performance: 2x faster chunk processing
Technical Details:
- File:
src/infrastructure/ai/semantic_cleaner_impl.rs - Function:
filter_by_relevance() - PR: #11
- Commits: c7ca7b4, c966529
Code Review Compliance:
- β
anti-unwrap-abuseβ No.unwrap()in production - β
own-borrow-over-cloneβ Minimized cloning - β
mem-reuse-collectionsβ Pre-allocated vectors - β
async-join-parallelβ Concurrent embeddings
- Fork the repository
- Clone your fork:
git clone https://github.com/YOUR_USERNAME/rust-scraper.git cd rust-scraper - Create a branch:
git checkout -b feat/your-feature-name
- Make changes and test:
cargo test --all-features - Commit and push:
git commit -m "feat: add your feature" git push origin feat/your-feature-name - Open a Pull Request
We follow Conventional Commits:
<type>(<scope>): <subject>
<body>
<footer>
Types:
featβ New featurefixβ Bug fixdocsβ Documentation changesstyleβ Formatting (no logic change)refactorβ Code restructuringtestβ Adding testschoreβ Maintenance tasks
Example:
feat(ai): add semantic cleaning with embeddings
- Implement SemanticCleaner trait
- Add ONNX runtime integration
- Preserve embeddings during filtering
- Add integration tests
Closes #9
- Rust: Follow Rust API Guidelines
- Formatting:
cargo fmt - Linting:
cargo clippy -- -D warnings - Testing: All PRs must pass existing tests + add new tests for new features
This project follows the rust-skills repository (179 rules):
- CRITICAL:
own-*,err-*,mem-*(ownership, errors, memory) - HIGH:
api-*,async-*,opt-*(API design, async, optimization) - MEDIUM:
name-*,type-*,test-*,doc-*(naming, types, testing, docs) - LOW:
proj-*,lint-*(project structure, linting)
Never:
- β
.unwrap()in production code - β Locks across
.await - β
&Vec<T>when&[T]works - β
format!()in hot paths
See rust-skills/INDEX.md for the full catalog.
# 1. Create branch
git checkout -b feat/your-feature
# 2. Make changes
# Edit files...
# 3. Run tests
cargo test --all-features
# 4. Lint
cargo clippy -- -D warnings
# 5. Format
cargo fmt
# 6. Commit
git add .
git commit -m "feat: your feature description"
# 7. Push
git push -u origin feat/your-featureSee docs/CONTRIBUTING.md for detailed contribution guidelines.
Licensed under either of:
- Apache License, Version 2.0 (
LICENSE-APACHE) - MIT License (
LICENSE-MIT)
at your option.
By contributing to this project, you agree that your contributions will be licensed under the same dual-license terms.
| Metric | Value |
|---|---|
| Lines of Code | 3,754 (src/) |
| Total Tests | 281 passing |
| Public Functions | 64 |
| MSRV | 1.80.0 |
| Dependencies | 45+ (core), 60+ (with AI) |
| Latest Version | 1.0.5 |
| Latest Commit | 39779d6 |
- v1.0.0 β Core scraping, TUI, sitemap support
- v1.0.5 β AI-powered semantic cleaning (Issue #9)
- v1.0.5 β Embeddings preservation bug fix (PR #11)
- v1.0.5 β Performance optimization (eliminated unnecessary cloning)
- v1.1.0 β Multi-domain crawling
- v1.2.0 β JavaScript rendering (headless browser)
- v2.0.0 β Distributed scraping
- Built with Clean Architecture principles
- Inspired by ripgrep performance patterns
- Uses rust-skills (179 rules)
- AI features powered by tract-onnx and HuggingFace tokenizers
Made with β€οΈ using Rust and Clean Architecture
Current Status: β All tests passing (281/281) | β CI/CD enabled | β Production-ready