#text-extraction

  1. pdf_oxide

    The fastest Rust PDF library with text extraction: 0.8ms mean, 100% pass rate on 3,830 PDFs. 5× faster than pdf_extract, 17× faster than oxidize_pdf. Extract, create, and edit PDFs.

    v0.3.21 15K #pdf #pdf-parser #text-extraction #document-parser #pdf-to-markdown #text-document
  2. deformat

    Extract plain text from HTML, PDF, and other document formats

    v0.6.0 1.2K #html #pdf #nlp #text-html #text-extraction
  3. unpdf

    High-performance PDF content extraction to Markdown, text, and JSON

    v0.2.4 1.4K #pdf #markdown #text-extraction #document-parser #text-document #pdf-parser
  4. bolivar-cli

    PDF text extraction CLI tools

    v1.6.1 #pdf #text-extraction #pdfplumber #pdfminer
  5. pdf-text-extract

    Extract text, tables, and structured content from PDF files

    v0.2.0 #pdf-parser #pdf #markdown #text-extraction #parser
  6. wikipedia-article-transform

    Transform Wikipedia articles in html to plaintext and markdown formats

    v0.4.0 #tree-sitter #html #nlp #text-extraction
  7. folio-pdf

    A comprehensive PDF library for Rust

    v0.0.3 #pdf #pdf-parser #text-extraction #document-parser #parser
  8. keyword_extraction

    Collection of algorithms for keyword extraction from text

    v1.5.0 180 #tf-idf #algorithm #text-extraction
  9. kawat

    Web content extraction library inspired by trafilatura. Extracts main text, metadata, and comments from HTML.

    v0.1.3 #web-scraping #text-extraction #nlp #html #boilerplate-removal
  10. lopdf-parang

    A fork of lopdf optimized for PDF text extraction — lazy streams, O(1) object slicing, zlib-rs

    v0.39.1 #pdf #text-extraction #parser
  11. pdfplumber-cli

    Command-line tool to extract text, characters, words, and tables from PDF documents

    v0.2.0 #pdf #table #text-extraction #cli-table
  12. llm-text

    processing text for LLM consumption

    v0.1.0 #nlp #llm #text-extraction #html #text-html
  13. justpdf

    Pure Rust PDF engine - read, render, extract, create, modify

    v0.1.2 #pdf #document #text-extraction #render #parser #graphics
  14. rpdfium

    A faithful Rust port of Google's PDFium PDF rendering engine

    v7676.6.4 #pdf #text-extraction #pdfium #document
  15. hwarang

    Fast HWP document text extractor

    v0.2.0 #hwp #hwpx #text-extraction #korean #hancom
  16. pdfvec

    High-performance PDF text extraction library for vectorization pipelines

    v0.1.1 #pdf #vectorization #nlp #text-extraction
  17. pdfplumber

    Extract chars, words, lines, rects, and tables from PDF documents with precise coordinates

    v0.2.0 #pdf #table #text-extraction #document
  18. pdf_oxide_cli

    CLI for pdf-oxide — the fastest PDF toolkit. 22 commands: text extraction, PDF to markdown, search, merge, split, images, compress, encrypt, watermark, forms, and more.

    v0.3.21 #pdf #text-extraction #pdf-to-markdown #pdf-toolkit #cli-toolkit
  19. papyrus-core

    PDF-to-Markdown conversion engine with smart heading detection, bold/italic text extraction, and CommonMark output. Pure Rust, best-effort parsing for corrupted PDFs.

    v0.1.0 #markdown #convert #pdf #text-extraction #extract-text
  20. docx-lite

    Lightweight, fast DOCX text extraction library with minimal dependencies

    v0.2.0 1.4K #docx #text-extraction #parser #word #office
  21. elizaos-plugin-pdf

    elizaOS PDF Plugin - PDF reading and text extraction

    v2.0.0 #pdf #elizaos #document-processing #text-extraction
  22. unpdf-cli

    CLI tool for extracting PDF content to Markdown, text, and JSON

    v0.2.3 #markdown #pdf #text-extraction
  23. parangi

    PDF text extraction library — Rust port of Apache PDFBox

    v0.1.0 #pdf #text-extraction #pdfbox
  24. heavy-pdf-parser

    Extract text from PDF files with support for multiple output formats

    v0.1.0 #pdf #text-extraction #document-processing #rust
  25. epub-parser

    extracting metadata, table of contents, text, cover, and images from EPUB files

    v0.3.4 #ebook #epub #text-extraction #metadata #parser
  26. parser-core

    extracting text from various file formats including PDF, DOCX, XLSX, PPTX, images via OCR, and more

    v0.1.3 120 #docx #text-parser #pdf #ocr #text-extraction
  27. arabic_pdf_to_text

    A CLI tool to convert Arabic PDFs to text using Google's Gemini API

    v0.1.0 #gemini-api #pdf #arabic #text-extraction
  28. justpdf-core

    Pure Rust PDF engine — parsing, writing, compression, text extraction, encryption, digital signatures

    v0.1.3 #pdf #signature #pdf-parser #text-extraction #compression
  29. Try searching with DuckDuckGo.

  30. pdfplumber-core

    Core data types and algorithms for pdfplumber-rs (backend-independent)

    v0.2.0 #pdf #text-extraction #table
  31. the-daily-stallman

    Read the news like Stallman would. No JavaScript required.

    v0.3.1 #stallman #text-extraction #rms #news