#pdf #html #nlp #text-extraction #document #text-html

deformat

Extract plain text from HTML, PDF, and other document formats

10 releases (4 breaking)

0.5.1 Mar 12, 2026
0.5.0 Mar 11, 2026
0.4.2 Mar 9, 2026
0.3.1 Mar 6, 2026
0.1.0 Mar 6, 2026

#407 in Text processing

Download history 548/week @ 2026-03-04 259/week @ 2026-03-11 44/week @ 2026-03-18 412/week @ 2026-03-25

1,263 downloads per month
Used in anno-lib

MIT/Apache

145KB
3K SLoC

deformat

crates.io docs.rs

Extracts plain text from HTML, PDF, and other document formats. Operates on &str and &[u8] inputs -- no network I/O, no filesystem access (except PDF file extraction).

Supported formats

Format Input Feature flag Extractor
HTML (tag strip) &str (none -- always available) html::strip_to_text
HTML (layout-aware) &str html2text extract_html2text
HTML (article) &str readability extract_readable
PDF &Path or &[u8] pdf pdf::extract_file, pdf::extract_bytes
Plain text / Markdown &str (none) passthrough

The default build depends only on memchr.

Install

cargo add deformat                                        # minimal
cargo add deformat --features readability,html2text,pdf   # all extractors
[dependencies]
deformat = { version = "0.5.1", features = ["readability", "html2text"] }

Usage

Auto-detect and extract

use deformat::{extract, Format};

let result = extract("<p>Hello <b>world</b>!</p>").unwrap();
assert_eq!(result.text, "Hello world!");
assert_eq!(result.format, Format::Html);

// Plain text passes through unchanged
let result = extract("Just plain text.").unwrap();
assert_eq!(result.text, "Just plain text.");
assert_eq!(result.format, Format::PlainText);

All extraction functions return an Extracted struct:

pub struct Extracted {
    pub text: String,
    pub format: Format,
    pub extractor: String,       // e.g. "strip", "readability", "pdf-extract"
    pub title: Option<String>,   // article title (readability only)
    pub excerpt: Option<String>, // article excerpt (readability only)
    pub fallback: bool,          // true if a richer extractor failed
}

HTML strategies

// 1. Tag stripping (always available, fast)
let text = deformat::html::strip_to_text("<p>Hello <b>world</b>!</p>");
assert_eq!(text, "Hello world!");

// Standalone entity decoding
assert_eq!(deformat::html::decode_entities("Caf&eacute;"), "Cafe\u{0301}");
// 2. Layout-aware DOM conversion (feature: html2text)
let result = deformat::extract_html2text("<table><tr><td>A</td></tr></table>", 80);
// 3. Article extraction via Mozilla Readability (feature: readability)
//    Falls back to tag stripping if content is too short (< 50 chars).
let result = deformat::extract_readable(html, Some("https://example.com/article"));

PDF extraction

// From file path (feature: pdf)
let result = deformat::pdf::extract_file(std::path::Path::new("report.pdf"))?;

// From bytes in memory
let result = deformat::pdf::extract_bytes(&pdf_bytes)?;

Format detection

use deformat::detect::{is_html, is_pdf, detect_str, detect_bytes, detect_path};
use deformat::Format;

assert!(is_html("<!DOCTYPE html><html>..."));
assert_eq!(detect_str("<html><body>Hello</body></html>"), Format::Html);
assert_eq!(detect_bytes(b"%PDF-1.4 ..."), Format::Pdf);
assert_eq!(detect_path("report.pdf"), Format::Pdf);

HTML tag stripping details

html::strip_to_text handles: tag removal, script/style/noscript content removal, semantic element filtering (<nav>, <header>, <footer>, <aside>, <form>, etc.), ~300 named HTML entities (Latin, Greek, math, typography), numeric/hex character references, Windows-1252 C1 range mapping, CJK ruby annotation stripping, Wikipedia boilerplate removal, reference marker stripping ([1], [edit]), image alt text extraction, and whitespace collapsing.

License

MIT OR Apache-2.0

Dependencies

~0–4.5MB
~71K SLoC