A curated list of resources, tools, research papers, and repositories related to Document Understanding and Analysis. This repository aims to provide researchers, practitioners, and enthusiasts with a comprehensive guide to navigate the field of document understanding, covering various aspects such as Optical Character Recognition (OCR), Document Layout Analysis, Information Extraction, and more!
- Introduction
- Core Concepts
- Awesome Papers 📚
- Datasets 📊
- Models & Implementations 🤖
- Tools 🔧
- Applications 🏭
- Projects & Repositories 🗂️
- Tutorials & Guides 📖
- Contributing 🤝
- Acknowledgments 🙏
Document understanding is a multidisciplinary field that encompasses various techniques from natural language processing (NLP), computer vision, and machine learning to enable the extraction, classification, and interpretation of information from structured and unstructured documents. From traditional OCR systems to modern deep learning-based approaches, this repository covers the latest advancements and resources to accelerate your research or projects in this domain.
- Document Layout Analysis: Techniques to segment and analyze document layouts (e.g., paragraphs, images, tables).
- Optical Character Recognition (OCR): Methods for recognizing and extracting text from scanned documents or images.
- Information Extraction: Techniques to extract structured information from unstructured documents.
- Document Image Enhancement: Methods to preprocess and enhance document images (e.g., binarization, skew correction).
- Document Classification: Categorizing documents into predefined classes or identifying document types.
- Table and Form Understanding: Analyzing and extracting structured data from complex tables or forms.
- "Deep Learning for Document Understanding" - A comprehensive review on deep learning techniques for document understanding.
- "TrOCR: Transformer-based OCR with Pre-trained Language Models" - A state-of-the-art OCR model using transformers for text recognition.
- "Detecting Tables in Scanned Document Images using Convolutional Neural Networks" - A method to detect and extract tables from historical documents.
- "DocFormer: A Transformer-based Document Understanding Model" - An advanced model for document understanding using transformers and layout-aware embeddings.
For more, see the Papers List.
- PubLayNet: A large-scale dataset for document layout analysis based on PubMed Central.
- RVL-CDIP: A dataset for document image classification with over 400,000 labeled images.
- FUNSD: A dataset for form understanding with annotations for key-value pairs.
- IAM Handwriting Dataset: A widely used dataset for handwritten text recognition.
- DIBCO: Datasets focused on document image binarization challenges.
For more, see the Datasets List.
- Tesseract OCR: A powerful open-source OCR engine.
- LayoutLM: An advanced model for document understanding that captures both text and layout information.
- Detectron2: A modular object detection library that can be used for document layout analysis.
- TrOCR: Transformer-based OCR with pre-trained language models.
- DocFormer: A transformer-based model for end-to-end document understanding.
For more, see the Models List.
- PdfMiner: A Python tool for extracting information from PDFs.
- Apache Tika: A toolkit for extracting metadata and text from a variety of file types.
- PaddleOCR: An easy-to-use OCR tool for multilingual text detection and recognition.
- Tabula: A tool for extracting tables from PDF documents.
- pdf2image: A simple tool for converting PDF pages into image files.
For more, see the Tools List.
- Invoice and Receipt Processing: Automating the extraction of information from receipts and invoices.
- Historical Document Digitization: Enhancing and recognizing text in historical documents for digital archives.
- Forms Processing: Automatic form filling and extraction.
- Digital Signage Analysis: Recognizing and analyzing text in digital images for compliance checks.
- Content Moderation: Identifying and filtering inappropriate content in digital documents.
- Microsoft/OCR-D: A toolkit for building document understanding applications.
- DeepMind/DeepOCR: A library for deep learning-based OCR systems.
- Google/DocQ: A system for end-to-end document understanding and question answering.
- LayoutParser: A Python library for extracting layout information from documents.
- huggingface/transformers: A popular library for building transformer-based models for document understanding.
- Document Understanding with Transformers: An introductory guide to building document understanding systems using transformers.
- Implementing OCR with Tesseract: A step-by-step guide to implementing OCR using the Tesseract library.
- Training LayoutLM for Document Classification: A tutorial on fine-tuning LayoutLM for document classification tasks.
- Building a Custom Document Image Dataset: Tips and techniques for creating a high-quality document image dataset.
Contributions are welcome! If you have a suggestion, please open an issue or submit a pull request. Follow the contributing guidelines in CONTRIBUTING.md.
A big thank you to the researchers, engineers, and contributors who have made the field of Document Understanding so exciting and innovative. This list would not be possible without your contributions.
Feel free to use and share this repository, and don't forget to star 🌟 it if you find it useful!