4 releases
Uses new Rust 2024
| 0.1.3 | Nov 29, 2025 |
|---|---|
| 0.1.2 | Aug 9, 2025 |
| 0.1.1 | Jun 22, 2025 |
| 0.1.0 | Jun 11, 2025 |
#2028 in Development tools
20MB
782 lines
Contains (DOS exe, 20MB) bin/chromedriver/chromedriver.exe
Web Parser + Search
This web page parser library allows asynchronous fetching and extracting of data from web pages in multiple formats.
- Asynchronous web search using the search engines [Google, Bing, Duck, Ecosia, Yahoo, Wiki] with domain blacklisting (feature
search). - You can also create a custom search engine by using the
SearchEnginetrait (featuresearch). - Reading an HTML document from a URL with a randomized user-agent (User::random()).
- Selecting elements by CSS selectors and retrieving their attributes and content.
- Fetching the full page as plain text.
- Fetching and parsing page content as JSON with serde_json support.
This tool is well-suited for web scraping and data extraction tasks, offering flexible parsing of HTML, plain text, and JSON to enable comprehensive data gathering from various web sources.
Examples:
Web Search (feature: 'search'):
Requires the chromedriver tool installed!
use web_parser::prelude::*;
use macron::path;
#[tokio::main]
async fn main() -> Result<()> {
// WEB SEARCH:
let chrome_path = path!("bin/chromedriver/chromedriver.exe");
let session_path = path!("%/ChromeDriver/WebSearch");
// start search engine:
let mut engine = SearchEngine::<Duck>::new(
chrome_path,
Some(session_path),
false,
).await?;
println!("Searching results..");
// send search query:
let results = engine.search(
"Rust (programming language)", // query
&["support.google.com", "youtube.com"], // black list
1000 // sleep in millis
).await;
// handle search results:
match results {
Ok(cites) => {
println!("Result cites list: {:#?}", cites.get_urls());
/*
println!("Reading result pages..");
let contents = cites.read(
5, // cites count to read
&[ // tag name black list
"header", "footer", "style", "script", "noscript",
"iframe", "button", "img", "svg"
]
).await?;
println!("Results: {contents:#?}");
*/
}
Err(e) => eprintln!("Search error: {e}")
}
// stop search engine:
engine.stop().await?;
Ok(())
}
Web Parsing:
use web_parser::prelude::*;
#[tokio::main]
async fn main() -> Result<()> {
// READ PAGE AS HTML DOCUMENT:
// read website page:
let mut doc = Document::read("https://example.com/", User::random()).await?;
// select title:
let title = doc.select("h1")?.expect("No elements found");
println!("Title: '{}'", title.text());
// select descriptions:
let mut descrs = doc.select_all("p")?.expect("No elements found");
while let Some(descr) = descrs.next() {
println!("Description: '{}'", descr.text())
}
// READ PAGE AS PLAIN TEXT:
let text: String = Document::text("https://example.com/", User::random()).await?;
println!("Text: {text}");
// READ PAGE AS JSON:
let json: serde_json::Value = Document::json("https://example.com/", User::random()).await?.expect("Failed to parse JSON");
println!("Json: {json}");
Ok(())
}
Licensing:
Distributed under the MIT license.
Feedback:
You can find me here, also see my channel. I welcome your suggestions and feedback!
Copyright (c) 2025 Bulat Sh. (fuderis)
Dependencies
~8–16MB
~263K SLoC