10 releases (4 breaking)
| 0.5.1 | Mar 12, 2026 |
|---|---|
| 0.5.0 | Mar 11, 2026 |
| 0.4.2 | Mar 9, 2026 |
| 0.3.1 | Mar 6, 2026 |
| 0.1.0 | Mar 6, 2026 |
#407 in Text processing
1,263 downloads per month
Used in anno-lib
145KB
3K
SLoC
deformat
Extracts plain text from HTML, PDF, and other document formats. Operates on
&str and &[u8] inputs -- no network I/O, no filesystem access (except
PDF file extraction).
Supported formats
| Format | Input | Feature flag | Extractor |
|---|---|---|---|
| HTML (tag strip) | &str |
(none -- always available) | html::strip_to_text |
| HTML (layout-aware) | &str |
html2text |
extract_html2text |
| HTML (article) | &str |
readability |
extract_readable |
&Path or &[u8] |
pdf |
pdf::extract_file, pdf::extract_bytes |
|
| Plain text / Markdown | &str |
(none) | passthrough |
The default build depends only on memchr.
Install
cargo add deformat # minimal
cargo add deformat --features readability,html2text,pdf # all extractors
[dependencies]
deformat = { version = "0.5.1", features = ["readability", "html2text"] }
Usage
Auto-detect and extract
use deformat::{extract, Format};
let result = extract("<p>Hello <b>world</b>!</p>").unwrap();
assert_eq!(result.text, "Hello world!");
assert_eq!(result.format, Format::Html);
// Plain text passes through unchanged
let result = extract("Just plain text.").unwrap();
assert_eq!(result.text, "Just plain text.");
assert_eq!(result.format, Format::PlainText);
All extraction functions return an Extracted struct:
pub struct Extracted {
pub text: String,
pub format: Format,
pub extractor: String, // e.g. "strip", "readability", "pdf-extract"
pub title: Option<String>, // article title (readability only)
pub excerpt: Option<String>, // article excerpt (readability only)
pub fallback: bool, // true if a richer extractor failed
}
HTML strategies
// 1. Tag stripping (always available, fast)
let text = deformat::html::strip_to_text("<p>Hello <b>world</b>!</p>");
assert_eq!(text, "Hello world!");
// Standalone entity decoding
assert_eq!(deformat::html::decode_entities("Café"), "Cafe\u{0301}");
// 2. Layout-aware DOM conversion (feature: html2text)
let result = deformat::extract_html2text("<table><tr><td>A</td></tr></table>", 80);
// 3. Article extraction via Mozilla Readability (feature: readability)
// Falls back to tag stripping if content is too short (< 50 chars).
let result = deformat::extract_readable(html, Some("https://example.com/article"));
PDF extraction
// From file path (feature: pdf)
let result = deformat::pdf::extract_file(std::path::Path::new("report.pdf"))?;
// From bytes in memory
let result = deformat::pdf::extract_bytes(&pdf_bytes)?;
Format detection
use deformat::detect::{is_html, is_pdf, detect_str, detect_bytes, detect_path};
use deformat::Format;
assert!(is_html("<!DOCTYPE html><html>..."));
assert_eq!(detect_str("<html><body>Hello</body></html>"), Format::Html);
assert_eq!(detect_bytes(b"%PDF-1.4 ..."), Format::Pdf);
assert_eq!(detect_path("report.pdf"), Format::Pdf);
HTML tag stripping details
html::strip_to_text handles: tag removal, script/style/noscript content removal,
semantic element filtering (<nav>, <header>, <footer>, <aside>, <form>,
etc.), ~300 named HTML entities (Latin, Greek, math, typography), numeric/hex character
references, Windows-1252 C1 range mapping, CJK ruby annotation stripping, Wikipedia
boilerplate removal, reference marker stripping ([1], [edit]), image alt text
extraction, and whitespace collapsing.
License
MIT OR Apache-2.0
Dependencies
~0–4.5MB
~71K SLoC