10 releases (4 breaking)

0.5.1	Mar 12, 2026
0.5.0	Mar 11, 2026
0.4.2	Mar 9, 2026
0.3.1	Mar 6, 2026
0.1.0	Mar 6, 2026

#407 in Text processing

1,263 downloads per month
Used in anno-lib

MIT/Apache

145KB
3K SLoC

deformat

Extracts plain text from HTML, PDF, and other document formats. Operates on &str and &[u8] inputs -- no network I/O, no filesystem access (except PDF file extraction).

Supported formats

Format	Input	Feature flag	Extractor
HTML (tag strip)	`&str`	(none -- always available)	`html::strip_to_text`
HTML (layout-aware)	`&str`	`html2text`	`extract_html2text`
HTML (article)	`&str`	`readability`	`extract_readable`
PDF	`&Path` or `&[u8]`	`pdf`	`pdf::extract_file`, `pdf::extract_bytes`
Plain text / Markdown	`&str`	(none)	passthrough

The default build depends only on memchr.

Install

cargo add deformat                                        # minimal
cargo add deformat --features readability,html2text,pdf   # all extractors

[dependencies]
deformat = { version = "0.5.1", features = ["readability", "html2text"] }

Usage

Auto-detect and extract

use deformat::{extract, Format};

let result = extract("<p>Hello <b>world</b>!</p>").unwrap();
assert_eq!(result.text, "Hello world!");
assert_eq!(result.format, Format::Html);

// Plain text passes through unchanged
let result = extract("Just plain text.").unwrap();
assert_eq!(result.text, "Just plain text.");
assert_eq!(result.format, Format::PlainText);

All extraction functions return an Extracted struct:

pub struct Extracted {
    pub text: String,
    pub format: Format,
    pub extractor: String,       // e.g. "strip", "readability", "pdf-extract"
    pub title: Option<String>,   // article title (readability only)
    pub excerpt: Option<String>, // article excerpt (readability only)
    pub fallback: bool,          // true if a richer extractor failed
}

HTML strategies

// 1. Tag stripping (always available, fast)
let text = deformat::html::strip_to_text("<p>Hello <b>world</b>!</p>");
assert_eq!(text, "Hello world!");

// Standalone entity decoding
assert_eq!(deformat::html::decode_entities("Caf&eacute;"), "Cafe\u{0301}");

// 2. Layout-aware DOM conversion (feature: html2text)
let result = deformat::extract_html2text("<table><tr><td>A</td></tr></table>", 80);

// 3. Article extraction via Mozilla Readability (feature: readability)
//    Falls back to tag stripping if content is too short (< 50 chars).
let result = deformat::extract_readable(html, Some("https://example.com/article"));

PDF extraction

// From file path (feature: pdf)
let result = deformat::pdf::extract_file(std::path::Path::new("report.pdf"))?;

// From bytes in memory
let result = deformat::pdf::extract_bytes(&pdf_bytes)?;

Format detection

use deformat::detect::{is_html, is_pdf, detect_str, detect_bytes, detect_path};
use deformat::Format;

assert!(is_html("<!DOCTYPE html><html>..."));
assert_eq!(detect_str("<html><body>Hello</body></html>"), Format::Html);
assert_eq!(detect_bytes(b"%PDF-1.4 ..."), Format::Pdf);
assert_eq!(detect_path("report.pdf"), Format::Pdf);

HTML tag stripping details

html::strip_to_text handles: tag removal, script/style/noscript content removal, semantic element filtering (<nav>, <header>, <footer>, <aside>, <form>, etc.), ~300 named HTML entities (Latin, Greek, math, typography), numeric/hex character references, Windows-1252 C1 range mapping, CJK ruby annotation stripping, Wikipedia boilerplate removal, reference marker stripping ([1], [edit]), image alt text extraction, and whitespace collapsing.

License

MIT OR Apache-2.0

Dependencies

~0–4.5MB
~71K SLoC