Expand description
Rust Scraper — Production-ready web scraper with Clean Architecture
Rust Scraper is a high-performance, async web scraper designed for building RAG (Retrieval-Augmented Generation) datasets. Built with Clean Architecture principles for production use.
§Features
- Async Web Scraping: Multi-threaded with Tokio runtime
- Sitemap Support: Zero-allocation streaming parser (quick-xml)
- Gzip decompression (async-compression)
- Sitemap index recursion (max depth 3)
- Auto-discovery from
robots.txt
- TUI Interactivo: Ratatui + crossterm URL selector
- Interactive checkbox selection
- Confirmation mode before download
- Terminal restore on panic/exit
- Clean Architecture: Domain → Application → Infrastructure → Adapters
- Error Handling:
thiserrorfor libraries,anyhowfor applications - Performance: True streaming (~8KB RAM), LazyLock cache, bounded concurrency
- Security: SSRF prevention, Windows-safe filenames, WAF bypass prevention
§Architecture
Following Clean Architecture with four layers:
Domain (entities, errors)
↓
Application (services, use cases)
↓
Infrastructure (HTTP, parsers, converters)
↓
Adapters (TUI, CLI, detectors)Dependency Rule: Dependencies point inward. Domain never imports frameworks.
§Examples
§Basic Usage
use rust_scraper::{create_http_client, scrape_with_readability, ScraperConfig};
let client = create_http_client()?;
let url = url::Url::parse("https://example.com")?;
let config = ScraperConfig::default();
let results = scrape_with_readability(&client, &url).await?;§URL Discovery with Sitemap
use rust_scraper::{discover_urls_for_tui, CrawlerConfig};
use url::Url;
let seed = Url::parse("https://example.com")?;
let config = CrawlerConfig::builder(seed)
.concurrency(5)
.use_sitemap(true)
.build();
let urls = discover_urls_for_tui("https://example.com", &config).await?;
println!("Found {} URLs", urls.len());§Custom Configuration
use rust_scraper::ScraperConfig;
let config = ScraperConfig::default()
.with_images()
.with_documents()
.with_output_dir("./output".into())
.with_scraper_concurrency(5);
assert!(config.download_images);
assert!(config.download_documents);
assert_eq!(config.scraper_concurrency, 5);§Error Handling
This library uses thiserror for type-safe error handling.
All fallible functions return Result<T, ScraperError>.
use rust_scraper::{validate_and_parse_url, ScraperError};
match validate_and_parse_url("https://example.com") {
Ok(url) => println!("Valid URL: {}", url),
Err(ScraperError::InvalidUrl(msg)) => eprintln!("Invalid URL: {}", msg),
Err(e) => eprintln!("Error: {}", e),
}§Performance
- Streaming: Constant ~8KB RAM usage, no OOM risks
- Zero-Allocation Parsing: quick-xml for sitemaps
- LazyLock Cache: Syntax highlighting (2-10ms → ~0.01ms)
- Bounded Concurrency: Configurable parallel downloads
§Security
- SSRF Prevention: URL host comparison (not string contains)
- Windows Safe: Reserved names blocked (
CON→CON_safe) - WAF Bypass Prevention: Chrome 131+ UAs with TTL caching
- Input Validation:
url::Url::parse()(RFC 3986 compliant)
§Testing
# Run all tests
cargo test
# Run with output
cargo test -- --nocapture
# Run specific test
cargo test test_validate_and_parse_urlTests: 19 passing ✅
§MSRV
Minimum Supported Rust Version: 1.75.0
Re-exports§
pub use domain::ContentType;pub use domain::CrawlError;pub use domain::CrawlResult;pub use domain::CrawlerConfig;pub use domain::CrawlerConfigBuilder;pub use domain::DiscoveredUrl;pub use domain::DownloadedAsset;pub use domain::ExportFormat;pub use domain::ScrapedContent;pub use domain::ValidUrl;pub use application::crawl_site;pub use application::crawl_with_sitemap;pub use application::create_http_client;pub use application::discover_urls_for_tui;pub use application::extract_domain;pub use application::is_allowed;pub use application::is_excluded;pub use application::is_internal_link;pub use application::matches_pattern;pub use application::scrape_multiple_with_limit;pub use application::scrape_urls_for_tui;pub use application::scrape_with_config;pub use application::scrape_with_readability;pub use infrastructure::converter;pub use infrastructure::crawler;pub use infrastructure::export::jsonl_exporter;pub use infrastructure::export::state_store;pub use infrastructure::export::zvec_exporter;pub use infrastructure::http;pub use infrastructure::output::file_saver;pub use infrastructure::scraper::readability;pub use url_path::Domain;pub use url_path::OutputPath;pub use url_path::UrlPath;pub use user_agent::get_random_user_agent_from_pool;pub use user_agent::UserAgentCache;pub use export_factory::create_exporter;pub use export_factory::domain_from_url;pub use export_factory::process_results;pub use error::Result;pub use error::ScraperError;pub use infrastructure::output::file_saver::save_results;
Modules§
- adapters
- Adapters — External integrations (feature-gated)
- application
- Application layer — Use cases and orchestration
- config
- Configuration Module
- domain
- Domain layer — Core business entities (puro, sin frameworks)
- error
- Error handling module for rust_scraper.
- export_
factory - Export factory for creating exporters based on format
- extractor
- Asset Extraction Module
- infrastructure
- Infrastructure layer — External implementations (HTTP, FS, converters)
- url_
path - URL Path Types Module
- user_
agent - User-Agent module with TTL-based caching
Structs§
- Args
- CLI Arguments for the rust-scraper binary.
- Concurrency
Config - Concurrency configuration with smart auto-detection
- Scraper
Config - Configuration for web scraping and asset downloading.
Enums§
- Output
Format - Output format for scraped content.
Traits§
Functions§
- validate_
and_ parse_ url - Validate and parse a URL string using the
urlcrate (RFC 3986 compliant).