Crate libreadability

Expand description

Readability article extraction library.

libreadability extracts the main article content from web pages by analyzing DOM structure, scoring content density, and removing boilerplate. It is a Rust port of readability by readeck, itself a Go port of Mozilla’s Readability.js.

§Quick start

use libreadability::extract;

let html = r#"<html><body>
  <nav>Navigation links</nav>
  <article><p>This is the main article body with enough text to be extracted.</p>
  <p>The readability algorithm scores content density and identifies the
  primary article content, stripping navigation, ads, and other boilerplate.</p></article>
  <aside>Sidebar content</aside>
</body></html>"#;

let article = extract(html, None).expect("valid HTML");
assert!(!article.content.is_empty());
assert!(!article.text_content.is_empty());

§Output

Article contains both cleaned HTML (content) and plain text (text_content), plus metadata like title, byline, excerpt, published time, and text direction.

§Configuration

For fine-grained control, use Parser directly:

use libreadability::Parser;

let mut parser = Parser::new()
    .with_char_threshold(200)
    .with_keep_classes(true);
let article = parser.parse("<html><body><article><p>Content</p></article></body></html>", None);

trafilatura — full-featured web content extraction with metadata, comments, and fallback strategies.
justext — paragraph-level boilerplate removal using stopword density.
html2markdown — converts HTML to Markdown via an intermediate AST.

Structs§

Article: The extracted article content and metadata.
Parser: Port of Parser — the core readability extraction engine.

Enums§

Error

Functions§

extract: Parse HTML and extract the main article content in one call.