Expand description
Paragraph-level boilerplate removal for HTML.
justext classifies HTML paragraphs as content or boilerplate using
stopword density, link density, and text length — then refines
classifications using neighbor context.
§Quick start
use justext::{extract_text_lang, Config};
let html = "<html><body><p>This is the main content.</p></body></html>";
let text = extract_text_lang(html, "English", &Config::default()).unwrap();
println!("{text}");§Related crates
trafilatura— full-featured web content extraction with metadata, comments, and fallback strategies.libreadability— Mozilla Readability port for extracting a clean article DOM subtree.html2markdown— converts HTML to Markdown via an intermediate AST.
Re-exports§
pub use stoplists::available_languages;pub use stoplists::get_all_stoplists;pub use stoplists::get_stoplist;
Modules§
Structs§
- Config
- Configuration for the JusText algorithm.
- Paragraph
- A classified text paragraph extracted from HTML.
Enums§
- Class
Type - Classification label for a paragraph.
- Justext
Error
Functions§
- extract_
text - Convenience: extract only the good paragraph text.
- extract_
text_ lang - Extract only the good paragraph text using a language name.
- justext
- Classify paragraphs in HTML as content or boilerplate.
- justext_
lang - Classify paragraphs using a language name instead of a pre-loaded stoplist.