Skip to main content

Crate libreadability

Crate libreadability 

Source
Expand description

Readability article extraction library.

libreadability extracts the main article content from web pages by analyzing DOM structure, scoring content density, and removing boilerplate. It is a Rust port of readability by readeck, itself a Go port of Mozilla’s Readability.js.

§Quick start

use libreadability::extract;

let html = r#"<html><body>
  <nav>Navigation links</nav>
  <article><p>This is the main article body with enough text to be extracted.</p>
  <p>The readability algorithm scores content density and identifies the
  primary article content, stripping navigation, ads, and other boilerplate.</p></article>
  <aside>Sidebar content</aside>
</body></html>"#;

let article = extract(html, None).expect("valid HTML");
assert!(!article.content.is_empty());
assert!(!article.text_content.is_empty());

§Output

Article contains both cleaned HTML (content) and plain text (text_content), plus metadata like title, byline, excerpt, published time, and text direction.

§Configuration

For fine-grained control, use Parser directly:

use libreadability::Parser;

let mut parser = Parser::new()
    .with_char_threshold(200)
    .with_keep_classes(true);
let article = parser.parse("<html><body><article><p>Content</p></article></body></html>", None);
  • trafilatura — full-featured web content extraction with metadata, comments, and fallback strategies.
  • justext — paragraph-level boilerplate removal using stopword density.
  • html2markdown — converts HTML to Markdown via an intermediate AST.

Structs§

Article
The extracted article content and metadata.
Parser
Port of Parser — the core readability extraction engine.

Enums§

Error

Functions§

extract
Parse HTML and extract the main article content in one call.