Expand description
Readability article extraction library.
libreadability extracts the main article content from web pages by analyzing
DOM structure, scoring content density, and removing boilerplate. It is a
Rust port of readability by readeck,
itself a Go port of Mozilla’s Readability.js.
§Quick start
use libreadability::extract;
let html = r#"<html><body>
<nav>Navigation links</nav>
<article><p>This is the main article body with enough text to be extracted.</p>
<p>The readability algorithm scores content density and identifies the
primary article content, stripping navigation, ads, and other boilerplate.</p></article>
<aside>Sidebar content</aside>
</body></html>"#;
let article = extract(html, None).expect("valid HTML");
assert!(!article.content.is_empty());
assert!(!article.text_content.is_empty());§Output
Article contains both cleaned HTML (content) and
plain text (text_content), plus metadata like
title, byline, excerpt, published time, and text direction.
§Configuration
For fine-grained control, use Parser directly:
use libreadability::Parser;
let mut parser = Parser::new()
.with_char_threshold(200)
.with_keep_classes(true);
let article = parser.parse("<html><body><article><p>Content</p></article></body></html>", None);§Related crates
trafilatura— full-featured web content extraction with metadata, comments, and fallback strategies.justext— paragraph-level boilerplate removal using stopword density.html2markdown— converts HTML to Markdown via an intermediate AST.
Structs§
- Article
- The extracted article content and metadata.
- Parser
- Port of
Parser— the core readability extraction engine.
Enums§
Functions§
- extract
- Parse HTML and extract the main article content in one call.