Skip to main content

Crate justext

Crate justext 

Source
Expand description

Paragraph-level boilerplate removal for HTML.

justext classifies HTML paragraphs as content or boilerplate using stopword density, link density, and text length — then refines classifications using neighbor context.

§Quick start

use justext::{extract_text_lang, Config};

let html = "<html><body><p>This is the main content.</p></body></html>";
let text = extract_text_lang(html, "English", &Config::default()).unwrap();
println!("{text}");
  • trafilatura — full-featured web content extraction with metadata, comments, and fallback strategies.
  • libreadability — Mozilla Readability port for extracting a clean article DOM subtree.
  • html2markdown — converts HTML to Markdown via an intermediate AST.

Re-exports§

pub use stoplists::available_languages;
pub use stoplists::get_all_stoplists;
pub use stoplists::get_stoplist;

Modules§

stoplists

Structs§

Config
Configuration for the JusText algorithm.
Paragraph
A classified text paragraph extracted from HTML.

Enums§

ClassType
Classification label for a paragraph.
JustextError

Functions§

extract_text
Convenience: extract only the good paragraph text.
extract_text_lang
Extract only the good paragraph text using a language name.
justext
Classify paragraphs in HTML as content or boilerplate.
justext_lang
Classify paragraphs using a language name instead of a pre-loaded stoplist.