Crate justext

Expand description

Paragraph-level boilerplate removal for HTML.

justext classifies HTML paragraphs as content or boilerplate using stopword density, link density, and text length — then refines classifications using neighbor context.

§Quick start

use justext::{extract_text_lang, Config};

let html = "<html><body><p>This is the main content.</p></body></html>";
let text = extract_text_lang(html, "English", &Config::default()).unwrap();
println!("{text}");

trafilatura — full-featured web content extraction with metadata, comments, and fallback strategies.
libreadability — Mozilla Readability port for extracting a clean article DOM subtree.
html2markdown — converts HTML to Markdown via an intermediate AST.

Re-exports§

pub use stoplists::available_languages;
pub use stoplists::get_all_stoplists;
pub use stoplists::get_stoplist;

Modules§

stoplists

Structs§

Config: Configuration for the JusText algorithm.
Paragraph: A classified text paragraph extracted from HTML.

Enums§

ClassType: Classification label for a paragraph.
JustextError

Functions§

extract_text: Convenience: extract only the good paragraph text.
extract_text_lang: Extract only the good paragraph text using a language name.
justext: Classify paragraphs in HTML as content or boilerplate.
justext_lang: Classify paragraphs using a language name instead of a pre-loaded stoplist.

Crate justext

Crate justext Copy item path

§Quick start

§Related crates

Re-exports§

Modules§

Structs§

Enums§

Functions§

Crate justext