3 unstable releases

Uses new Rust 2024

0.2.0 Feb 3, 2026
0.1.2 Jan 26, 2026
0.1.1 Jan 26, 2026
0.1.0 Jan 25, 2026

#77 in HTTP client

MIT license

47KB
1K SLoC

crawn

crates.io docs.rs CI License: MIT

A utility for web crawling and scraping


Features

  • Blazing fast – Built with Rust & tokio for async I/O and concurrency
  • Smart filtering – URL-based keyword matching (no content fetching required)
  • NDJSON output – One JSON object per line for easy streaming
  • BFS crawling – Depth-first traversal with configurable depth limits
  • Rate limiting – Configurable request rate (default: ~2req/sec)
  • Error recovery – Gracefully handles network errors and broken links
  • Rich logging – Colored, timestamped logs with context chains

Installation

Run this command (requires cargo):

cargo install crawn
  • Or build from source (requires cargo):
git clone https://github.com/Tahaa-Dev/crawn.git
cd crawn
cargo build --release

Usage

  • Basic Crawling:
crawn -o output.ndjson https://example.com
  • With Logging:
crawn -o output.ndjson -l crawler.log https://example.com
  • Verbose Mode (Log All Requests):
crawn -o output.ndjson -v https://example.com
  • Custom Depth Limit:
crawn -o output.ndjson -m 3 https://example.com
  • Full HTML:
crawn -o output.ndjson --include-content https://example.com
  • Extracted text only:
crawn -o output.ndjson --include-text https://example.com

Output Format

Results are written as NDJSON (newline-delimited JSON):

{"URL": "https://example.com", "Title": "Example Domain", "Links": 12}
{"URL": "https://example.com/about", "Title": "About Us", "Links": 9}
{"URL": "https://example.com/contact", "Title": "Contact", "Links": 48}
  • With --include-text:
{"URL": "https://example.com", "Title": "Example Domain", "Links": 27, "Text": "Example Domain\nThis domain is..."}
  • With --include-content:
{"URL": "https://example.com", "Title": "Example Domain", "Links": 30, "Content": "<!DOCTYPE html>\n<html>..."}

Logging

Log Levels:

  • INFO (verbose mode only): Request logs
  • WARN (always): Recoverable errors (404, network timeouts)
  • FATAL (always): Unrecoverable errors (invalid URL, disk full)

Log Format:

2026-01-24 02:37:40.351 [INFO]:
Sent request to URL: https://example.com

2026-01-24 02:37:41.123 [WARN]:
Failed to fetch URL: https://example.com/broken-link
Cause: HTTP 404 Not Found

Examples

  • Crawl Documentation Site:
crawn -o rust-docs.ndjson https://doc.rust-lang.org/book/
  • Crawl with Logging:
crawn -o output.ndjson -l crawler.log -v https://example.com
  • Limit to 2 Levels Deep:
crawn -o shallow.ndjson -m 2 https://example.com

Limitations

  • Same-domain only (no external links by design)
  • No JavaScript rendering (static HTML only)
  • No authentication (public pages only)

Notes

Dependencies

~10–26MB
~334K SLoC