Readme
Rust Web Crawler
A high-performance, concurrent web crawler implemented in Rust that demonstrates different approaches to web crawling with varying levels of concurrency and synchronization.
๐ Features
Multiple Crawling Strategies : Three different implementations showcasing various concurrency patterns
Thread-Safe Design : Proper synchronization using mutexes and channels
Extensible Architecture : Trait-based fetcher interface for easy testing and extension
Comprehensive Testing : Full test suite covering edge cases and error scenarios
Performance Optimized : Concurrent implementations for improved crawling speed
๐ Project Structure
rust_web_crawler/
โโโ src/
โ โโโ main. rs # Main implementation with three crawler variants
โโโ Cargo. toml # Project dependencies and metadata
โโโ Cargo. lock # Locked dependency versions
โโโ README . md # This file
๐ง Implementation Approaches
1. Serial Crawler
File : serial_crawler ( ) function
Approach : Sequential, depth-first traversal
Use Case : Simple crawling, debugging, or when order matters
Performance : Slower but predictable and easy to understand
2. Concurrent Mutex Crawler
File : concurrent_mutex_crawler ( ) function
Approach : Multi-threaded with shared state protected by mutex
Use Case : When you need fine-grained control over shared state
Performance : Good concurrency with thread safety
3. Concurrent Channel Crawler
File : concurrent_channel_crawler ( ) function
Approach : Worker pool pattern using channels for communication
Use Case : High-performance crawling with better resource management
Performance : Optimal concurrency with reduced contention
๐ ๏ธ Architecture
Core Components
Fetcher Trait
trait Fetcher : Send + Sync + 'static {
fn fetch ( & self , url : & str ) -> Result < Vec < String > , String > ;
}
Defines the interface for fetching URLs
Returns a list of discovered URLs or an error
Thread-safe and can be shared across threads
FakeFetcher
Test implementation that simulates web crawling
Pre-defined responses for consistent testing
Simulates network latency for realistic performance testing
๐ Getting Started
Prerequisites
Rust 1.70+ (edition 2024)
Cargo package manager
Installation
Clone the repository
git clone https://github.com/nabil-Tounarti/rust-web-crawler.git
cd rust-web-crawler
Build the project
cargo build
Run the crawler
cargo run
Running Tests
# Run all tests
cargo test
# Run tests with output
cargo test -- --nocapture
# Run specific test
cargo test test_serial_crawler_happy_path
The project demonstrates three different crawling approaches:
Approach
Concurrency
Thread Safety
Performance
Complexity
Serial
None
N/A
Slow
Low
Mutex
High
Mutex-protected
Medium
Medium
Channel
High
Channel-based
Fast
High
๐งช Testing
The project includes comprehensive tests covering:
Happy Path : Basic crawling functionality
Edge Cases : Non-existent URLs, fetch errors
Concurrency : Thread safety and race condition prevention
Error Handling : Proper error propagation and handling
Test Categories
Basic Functionality Tests
test_serial_crawler_happy_path
test_concurrent_mutex_crawler_happy_path
test_concurrent_channel_crawler_happy_path
Error Handling Tests
test_start_with_nonexistent_url
test_crawler_with_fetch_error
Edge Case Tests
๐ Extending the Project
Adding Real HTTP Fetcher
To implement a real web crawler, create a new fetcher:
use reqwest;
struct HttpFetcher {
client : reqwest:: Client,
}
impl Fetcher for HttpFetcher {
fn fetch ( & self , url : & str ) -> Result < Vec < String > , String > {
// Implement real HTTP fetching logic
// Parse HTML and extract links
// Return discovered URLs
}
}
Adding Configuration
Extend the project with configuration options:
struct CrawlerConfig {
max_depth : usize ,
max_concurrent_requests : usize ,
request_timeout : Duration,
user_agent : String,
}
๐ค Contributing
Fork the repository
Create a feature branch (git checkout - b feature/amazing-feature )
Commit your changes (git commit - m ' Add amazing feature' )
Push to the branch (git push origin feature/amazing-feature )
Open a Pull Request
๐ License
This project is licensed under the MIT License - see the LICENSE file for details.
๐ฏ Learning Objectives
This project serves as an excellent learning resource for:
Rust Concurrency : Understanding threads, mutexes, and channels
Trait-based Design : Building extensible and testable code
Error Handling : Proper error propagation in Rust
Testing : Comprehensive test coverage and mocking
Performance Optimization : Comparing different concurrency approaches
๐ Dependencies
tokio : Async runtime for high-performance networking (currently unused but ready for HTTP implementation)
Standard Library : Threading, collections, and synchronization primitives
๐ Future Enhancements
Note : This is currently a demonstration project using a fake fetcher. For production use, implement a real HTTP fetcher with proper error handling and rate limiting.