Skip to main content

Crate rust_scraper

Crate rust_scraper 

Source
Expand description

Rust Scraper — Production-ready web scraper with Clean Architecture

Rust Scraper is a high-performance, async web scraper designed for building RAG (Retrieval-Augmented Generation) datasets. Built with Clean Architecture principles for production use.

§Features

  • Async Web Scraping: Multi-threaded with Tokio runtime
  • Sitemap Support: Zero-allocation streaming parser (quick-xml)
    • Gzip decompression (async-compression)
    • Sitemap index recursion (max depth 3)
    • Auto-discovery from robots.txt
  • TUI Interactivo: Ratatui + crossterm URL selector
    • Interactive checkbox selection
    • Confirmation mode before download
    • Terminal restore on panic/exit
  • Clean Architecture: Domain → Application → Infrastructure → Adapters
  • Error Handling: thiserror for libraries, anyhow for applications
  • Performance: True streaming (~8KB RAM), LazyLock cache, bounded concurrency
  • Security: SSRF prevention, Windows-safe filenames, WAF bypass prevention

§Architecture

Following Clean Architecture with four layers:

Domain (entities, errors)
    ↓
Application (services, use cases)
    ↓
Infrastructure (HTTP, parsers, converters)
    ↓
Adapters (TUI, CLI, detectors)

Dependency Rule: Dependencies point inward. Domain never imports frameworks.

§Examples

§Basic Usage

use rust_scraper::{create_http_client, scrape_with_readability, ScraperConfig};

let client = create_http_client()?;
let url = url::Url::parse("https://example.com")?;
let config = ScraperConfig::default();
let results = scrape_with_readability(&client, &url).await?;

§URL Discovery with Sitemap

use rust_scraper::{discover_urls_for_tui, CrawlerConfig};
use url::Url;

let seed = Url::parse("https://example.com")?;
let config = CrawlerConfig::builder(seed)
    .concurrency(5)
    .use_sitemap(true)
    .build();

let urls = discover_urls_for_tui("https://example.com", &config).await?;
println!("Found {} URLs", urls.len());

§Custom Configuration

use rust_scraper::ScraperConfig;

let config = ScraperConfig::default()
    .with_images()
    .with_documents()
    .with_output_dir("./output".into())
    .with_scraper_concurrency(5);

assert!(config.download_images);
assert!(config.download_documents);
assert_eq!(config.scraper_concurrency, 5);

§Error Handling

This library uses thiserror for type-safe error handling. All fallible functions return Result<T, ScraperError>.

use rust_scraper::{validate_and_parse_url, ScraperError};

match validate_and_parse_url("https://example.com") {
    Ok(url) => println!("Valid URL: {}", url),
    Err(ScraperError::InvalidUrl(msg)) => eprintln!("Invalid URL: {}", msg),
    Err(e) => eprintln!("Error: {}", e),
}

§Performance

  • Streaming: Constant ~8KB RAM usage, no OOM risks
  • Zero-Allocation Parsing: quick-xml for sitemaps
  • LazyLock Cache: Syntax highlighting (2-10ms → ~0.01ms)
  • Bounded Concurrency: Configurable parallel downloads

§Security

  • SSRF Prevention: URL host comparison (not string contains)
  • Windows Safe: Reserved names blocked (CONCON_safe)
  • WAF Bypass Prevention: Chrome 131+ UAs with TTL caching
  • Input Validation: url::Url::parse() (RFC 3986 compliant)

§Testing

# Run all tests
cargo test

# Run with output
cargo test -- --nocapture

# Run specific test
cargo test test_validate_and_parse_url

Tests: 19 passing ✅

§MSRV

Minimum Supported Rust Version: 1.75.0

Re-exports§

pub use domain::ContentType;
pub use domain::CrawlError;
pub use domain::CrawlResult;
pub use domain::CrawlerConfig;
pub use domain::CrawlerConfigBuilder;
pub use domain::DiscoveredUrl;
pub use domain::DownloadedAsset;
pub use domain::ExportFormat;
pub use domain::ScrapedContent;
pub use domain::ValidUrl;
pub use application::crawl_site;
pub use application::crawl_with_sitemap;
pub use application::create_http_client;
pub use application::discover_urls_for_tui;
pub use application::extract_domain;
pub use application::is_allowed;
pub use application::is_excluded;
pub use application::matches_pattern;
pub use application::scrape_multiple_with_limit;
pub use application::scrape_urls_for_tui;
pub use application::scrape_with_config;
pub use application::scrape_with_readability;
pub use infrastructure::converter;
pub use infrastructure::crawler;
pub use infrastructure::export::jsonl_exporter;
pub use infrastructure::export::state_store;
pub use infrastructure::export::zvec_exporter;
pub use infrastructure::http;
pub use infrastructure::output::file_saver;
pub use infrastructure::scraper::readability;
pub use url_path::Domain;
pub use url_path::OutputPath;
pub use url_path::UrlPath;
pub use user_agent::get_random_user_agent_from_pool;
pub use user_agent::UserAgentCache;
pub use export_factory::create_exporter;
pub use export_factory::domain_from_url;
pub use export_factory::process_results;
pub use error::Result;
pub use error::ScraperError;
pub use infrastructure::output::file_saver::save_results;

Modules§

adapters
Adapters — External integrations (feature-gated)
application
Application layer — Use cases and orchestration
config
Configuration Module
domain
Domain layer — Core business entities (puro, sin frameworks)
error
Error handling module for rust_scraper.
export_factory
Export factory for creating exporters based on format
extractor
Asset Extraction Module
infrastructure
Infrastructure layer — External implementations (HTTP, FS, converters)
url_path
URL Path Types Module
user_agent
User-Agent module with TTL-based caching

Structs§

Args
CLI Arguments for the rust-scraper binary.
ConcurrencyConfig
Concurrency configuration with smart auto-detection
ScraperConfig
Configuration for web scraping and asset downloading.

Enums§

OutputFormat
Output format for scraped content.

Traits§

Parser
Parse command-line arguments into Self.
ValueEnum
Parse arguments into enums.

Functions§

validate_and_parse_url
Validate and parse a URL string using the url crate (RFC 3986 compliant).

Derive Macros§

Parser
Generates the Parser implementation.
ValueEnum
Generates the ValueEnum impl.