Crate rust_scraper

Expand description

Rust Scraper — Production-ready web scraper with Clean Architecture

Rust Scraper is a high-performance, async web scraper designed for building RAG (Retrieval-Augmented Generation) datasets. Built with Clean Architecture principles for production use.

§Features

Async Web Scraping: Multi-threaded with Tokio runtime
Sitemap Support: Zero-allocation streaming parser (quick-xml)
- Gzip decompression (async-compression)
- Sitemap index recursion (max depth 3)
- Auto-discovery from robots.txt
TUI Interactivo: Ratatui + crossterm URL selector
- Interactive checkbox selection
- Confirmation mode before download
- Terminal restore on panic/exit
Clean Architecture: Domain → Application → Infrastructure → Adapters
Error Handling: thiserror for libraries, anyhow for applications
Performance: True streaming (~8KB RAM), LazyLock cache, bounded concurrency
Security: SSRF prevention, Windows-safe filenames, WAF bypass prevention

§Architecture

Following Clean Architecture with four layers:

Domain (entities, errors)
    ↓
Application (services, use cases)
    ↓
Infrastructure (HTTP, parsers, converters)
    ↓
Adapters (TUI, CLI, detectors)

Dependency Rule: Dependencies point inward. Domain never imports frameworks.

§Examples

§Basic Usage

use rust_scraper::{create_http_client, scrape_with_readability, ScraperConfig};

let client = create_http_client()?;
let url = url::Url::parse("https://example.com")?;
let config = ScraperConfig::default();
let results = scrape_with_readability(&client, &url).await?;

§URL Discovery with Sitemap

use rust_scraper::{discover_urls_for_tui, CrawlerConfig};
use url::Url;

let seed = Url::parse("https://example.com")?;
let config = CrawlerConfig::builder(seed)
    .concurrency(5)
    .use_sitemap(true)
    .build();

let urls = discover_urls_for_tui("https://example.com", &config).await?;
println!("Found {} URLs", urls.len());

§Custom Configuration

use rust_scraper::ScraperConfig;

let config = ScraperConfig::default()
    .with_images()
    .with_documents()
    .with_output_dir("./output".into())
    .with_scraper_concurrency(5);

assert!(config.download_images);
assert!(config.download_documents);
assert_eq!(config.scraper_concurrency, 5);

§Error Handling

This library uses thiserror for type-safe error handling. All fallible functions return Result<T, ScraperError>.

use rust_scraper::{validate_and_parse_url, ScraperError};

match validate_and_parse_url("https://example.com") {
    Ok(url) => println!("Valid URL: {}", url),
    Err(ScraperError::InvalidUrl(msg)) => eprintln!("Invalid URL: {}", msg),
    Err(e) => eprintln!("Error: {}", e),
}

§Performance

Streaming: Constant ~8KB RAM usage, no OOM risks
Zero-Allocation Parsing: quick-xml for sitemaps
LazyLock Cache: Syntax highlighting (2-10ms → ~0.01ms)
Bounded Concurrency: Configurable parallel downloads

§Security

SSRF Prevention: URL host comparison (not string contains)
Windows Safe: Reserved names blocked (CON → CON_safe)
WAF Bypass Prevention: Chrome 131+ UAs with TTL caching
Input Validation: url::Url::parse() (RFC 3986 compliant)

§Testing

# Run all tests
cargo test

# Run with output
cargo test -- --nocapture

# Run specific test
cargo test test_validate_and_parse_url

Tests: 19 passing ✅

§MSRV

Minimum Supported Rust Version: 1.75.0

Re-exports§

pub use domain::ContentType;
pub use domain::CrawlError;
pub use domain::CrawlResult;
pub use domain::CrawlerConfig;
pub use domain::CrawlerConfigBuilder;
pub use domain::DiscoveredUrl;
pub use domain::DownloadedAsset;
pub use domain::ExportFormat;
pub use domain::ScrapedContent;
pub use domain::ValidUrl;
pub use application::crawl_site;
pub use application::crawl_with_sitemap;
pub use application::create_http_client;
pub use application::discover_urls_for_tui;
pub use application::extract_domain;
pub use application::is_allowed;
pub use application::is_excluded;
pub use application::is_internal_link;
pub use application::matches_pattern;
pub use application::scrape_multiple_with_limit;
pub use application::scrape_urls_for_tui;
pub use application::scrape_with_config;
pub use application::scrape_with_readability;
pub use infrastructure::converter;
pub use infrastructure::crawler;
pub use infrastructure::export::jsonl_exporter;
pub use infrastructure::export::state_store;
pub use infrastructure::export::zvec_exporter;
pub use infrastructure::http;
pub use infrastructure::output::file_saver;
pub use infrastructure::scraper::readability;
pub use url_path::Domain;
pub use url_path::OutputPath;
pub use url_path::UrlPath;
pub use user_agent::get_random_user_agent_from_pool;
pub use user_agent::UserAgentCache;
pub use export_factory::create_exporter;
pub use export_factory::domain_from_url;
pub use export_factory::process_results;
pub use error::Result;
pub use error::ScraperError;
pub use infrastructure::output::file_saver::save_results;

Modules§

adapters: Adapters — External integrations (feature-gated)
application: Application layer — Use cases and orchestration
config: Configuration Module
domain: Domain layer — Core business entities (puro, sin frameworks)
error: Error handling module for rust_scraper.
export_factory: Export factory for creating exporters based on format
extractor: Asset Extraction Module
infrastructure: Infrastructure layer — External implementations (HTTP, FS, converters)
url_path: URL Path Types Module
user_agent: User-Agent module with TTL-based caching

Structs§

Args: CLI Arguments for the rust-scraper binary.
ConcurrencyConfig: Concurrency configuration with smart auto-detection
ScraperConfig: Configuration for web scraping and asset downloading.

Enums§

OutputFormat: Output format for scraped content.

Traits§

Parser: Parse command-line arguments into Self.
ValueEnum: Parse arguments into enums.

Functions§

validate_and_parse_url: Validate and parse a URL string using the url crate (RFC 3986 compliant).

Derive Macros§

Parser: Generates the Parser implementation.
ValueEnum: Generates the ValueEnum impl.

Crate rust_scraper

Crate rust_scraper Copy item path

§Features

§Architecture

§Examples

§Basic Usage

§URL Discovery with Sitemap

§Custom Configuration

§Error Handling

§Performance

§Security

§Testing

§MSRV

Re-exports§

Modules§

Structs§

Enums§

Traits§

Functions§

Derive Macros§

Crate rust_scraper