3 releases
Uses new Rust 2024
| 0.1.2 | Jul 25, 2025 |
|---|---|
| 0.1.1 | Jul 24, 2025 |
| 0.1.0 | Jul 23, 2025 |
#170 in Concurrency
73 downloads per month
20KB
287 lines
Rust Web Crawler
A high-performance, concurrent web crawler implemented in Rust that demonstrates different approaches to web crawling with varying levels of concurrency and synchronization.
๐ Features
- Multiple Crawling Strategies: Three different implementations showcasing various concurrency patterns
- Thread-Safe Design: Proper synchronization using mutexes and channels
- Extensible Architecture: Trait-based fetcher interface for easy testing and extension
- Comprehensive Testing: Full test suite covering edge cases and error scenarios
- Performance Optimized: Concurrent implementations for improved crawling speed
๐ Project Structure
rust_web_crawler/
โโโ src/
โ โโโ main.rs # Main implementation with three crawler variants
โโโ Cargo.toml # Project dependencies and metadata
โโโ Cargo.lock # Locked dependency versions
โโโ README.md # This file
๐ง Implementation Approaches
1. Serial Crawler
- File:
serial_crawler()function - Approach: Sequential, depth-first traversal
- Use Case: Simple crawling, debugging, or when order matters
- Performance: Slower but predictable and easy to understand
2. Concurrent Mutex Crawler
- File:
concurrent_mutex_crawler()function - Approach: Multi-threaded with shared state protected by mutex
- Use Case: When you need fine-grained control over shared state
- Performance: Good concurrency with thread safety
3. Concurrent Channel Crawler
- File:
concurrent_channel_crawler()function - Approach: Worker pool pattern using channels for communication
- Use Case: High-performance crawling with better resource management
- Performance: Optimal concurrency with reduced contention
๐ ๏ธ Architecture
Core Components
Fetcher Trait
trait Fetcher: Send + Sync + 'static {
fn fetch(&self, url: &str) -> Result<Vec<String>, String>;
}
- Defines the interface for fetching URLs
- Returns a list of discovered URLs or an error
- Thread-safe and can be shared across threads
FakeFetcher
- Test implementation that simulates web crawling
- Pre-defined responses for consistent testing
- Simulates network latency for realistic performance testing
๐ Getting Started
Prerequisites
- Rust 1.70+ (edition 2024)
- Cargo package manager
Installation
-
Clone the repository
git clone https://github.com/nabil-Tounarti/rust-web-crawler.git cd rust-web-crawler -
Build the project
cargo build -
Run the crawler
cargo run
Running Tests
# Run all tests
cargo test
# Run tests with output
cargo test -- --nocapture
# Run specific test
cargo test test_serial_crawler_happy_path
๐ Performance Comparison
The project demonstrates three different crawling approaches:
| Approach | Concurrency | Thread Safety | Performance | Complexity |
|---|---|---|---|---|
| Serial | None | N/A | Slow | Low |
| Mutex | High | Mutex-protected | Medium | Medium |
| Channel | High | Channel-based | Fast | High |
๐งช Testing
The project includes comprehensive tests covering:
- Happy Path: Basic crawling functionality
- Edge Cases: Non-existent URLs, fetch errors
- Concurrency: Thread safety and race condition prevention
- Error Handling: Proper error propagation and handling
Test Categories
-
Basic Functionality Tests
test_serial_crawler_happy_pathtest_concurrent_mutex_crawler_happy_pathtest_concurrent_channel_crawler_happy_path
-
Error Handling Tests
test_start_with_nonexistent_urltest_crawler_with_fetch_error
-
Edge Case Tests
test_single_url_no_links
๐ Extending the Project
Adding Real HTTP Fetcher
To implement a real web crawler, create a new fetcher:
use reqwest;
struct HttpFetcher {
client: reqwest::Client,
}
impl Fetcher for HttpFetcher {
fn fetch(&self, url: &str) -> Result<Vec<String>, String> {
// Implement real HTTP fetching logic
// Parse HTML and extract links
// Return discovered URLs
}
}
Adding Configuration
Extend the project with configuration options:
struct CrawlerConfig {
max_depth: usize,
max_concurrent_requests: usize,
request_timeout: Duration,
user_agent: String,
}
๐ค Contributing
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
๐ License
This project is licensed under the MIT License - see the LICENSE file for details.
๐ฏ Learning Objectives
This project serves as an excellent learning resource for:
- Rust Concurrency: Understanding threads, mutexes, and channels
- Trait-based Design: Building extensible and testable code
- Error Handling: Proper error propagation in Rust
- Testing: Comprehensive test coverage and mocking
- Performance Optimization: Comparing different concurrency approaches
๐ Dependencies
- tokio: Async runtime for high-performance networking (currently unused but ready for HTTP implementation)
- Standard Library: Threading, collections, and synchronization primitives
๐ Future Enhancements
- Real HTTP fetcher implementation
- Rate limiting and politeness controls
- URL filtering and domain restrictions
- Persistent storage for crawled data
- Configuration file support
- Metrics and monitoring
- Distributed crawling capabilities
Note: This is currently a demonstration project using a fake fetcher. For production use, implement a real HTTP fetcher with proper error handling and rate limiting.
Dependencies
~2.1โ3MB
~46K SLoC