Representation Learning For Information Extraction From Form-Like Documents
Representation Learning For Information Extraction From Form-Like Documents
Abstract
We propose a novel approach using represen-
tation learning for tackling the problem of ex-
tracting structured information from form-like
document images. We propose an extraction
system that uses knowledge of the types of the
target fields to generate extraction candidates,
and a neural network architecture that learns a
dense representation of each candidate based Figure 1: Excerpts from sample invoices from different
on neighboring words in the document. These vendors. Instances of the invoice_date field are
learned representations are not only useful in highlighted in green.
solving the extraction task for unseen docu-
ment templates from two different domains,
but are also interpretable, as we show using
loss cases. showing the invoice date (highlighted in green)
and number in different layouts. Furthermore, in-
1 Introduction voices from the same supplier even share similar
presentation and differ only in specific values. We
In this paper, we present a novel approach to the refer to this unit of visual pattern that is similar
task of extracting structured information from form- across a collection of documents as a template,
like documents using a learned representation of and the fields of information that are common
an extraction candidate. Form-like documents like across templates in a domain as the schema. The
invoices, purchase orders, tax forms and insurance schema consists of fields like invoice_date
quotes are common in day-to-day business work- and total_amount, each associated with a type
flows, but current techniques for processing them like date and currency.
largely still employ either manual effort or brit- Extracting values for these fields from a given
tle and error-prone heuristics for extraction. The document, particularly one belonging to an unseen
research question motivating our work is the fol- template, is a challenging problem for many rea-
lowing: given a target set of fields for a particular sons. In contrast to most prior work on information
domain – e.g., due date and total amount for in- extraction (Sarawagi, 2008), templatic documents
voices – along with a small set of manually-labeled do not contain much prose. Approaches that work
examples, can we learn to extract these fields from well on natural text organized in sentences can-
unseen documents? not be applied directly to such documents where
Take, for instance, the domain of invoices, a doc- spatial layout elements like tables and grid format-
ument type that large enterprises often receive and ting are commonplace. Understanding spatial rela-
process thousands of times every week (iPayables, tionships is critical for achieving good extraction
2016). Invoices from different vendors often performance on such documents. Moreover, these
present the same types of information but with dif- documents are usually in PDF or scanned image
ferent layouts and positioning. Figure 1 shows the formats, so these presentation hints are not explic-
headers of invoices from a few different vendors itly available in a markup language. Techniques
†
Work done during an internship at Google Research that are successful on HTML documents such as
web pages, including traditional wrapper induction Observation 2 Each field instance is usually as-
approaches (Dalvi et al., 2011), are therefore not sociated with a key phrase that bears an apparent
immediately applicable. visual relationship with it. Consider the invoice ex-
Recently, there has been a surge in research in- cerpt in Figure 1(c). It contains two date instances,
terest in solving this extraction task adapting tech- only one of which is the true invoice_date,
niques in natural language processing (Liu et al., as indicated by the word “Date” next to it. Simi-
2019), computer vision (Davis et al., 2019), or com- larly, in the bottom-right invoice excerpt, we are
binations thereof (Katti et al., 2018). In contrast to easily able to distinguish between the invoice num-
this body of work, we propose an approach based ber (indicated by “Invoice #”) and the purchase
on representation learning for this task. We first order number (indicated by “PO #”). We call such
generate extraction candidates for each target field indicative words key phrases.
using its associated type (e.g., all dates as candi- Proximity is not the only criterion that defines a
dates for invoice_date). We then use a neural key phrase. For instance, the word “Date” is not the
network model to learn a dense representation for nearest one to the true invoice_date instance
each extraction candidate independent of the field in Figure 1(c); the document number in the line
to which it belongs. We also learn a separate repre- above and the page number below are clearly closer.
sentation for the field itself, and use the similarity It is also not the case that the key phrase always
between the candidate and field representations to occurs on the same line; Figure 1(a) shows a case
score the candidate according to how likely it is to where the key phrase “DATE” occurs just above
be the true extraction value for that field. the true invoice_date. An effective solution
The design of our extraction system rests on a needs to combine the spatial information along
few observations about how information is often with the textual information. Fortunately, in our
laid out in form-like documents (see Section 2). experience, these spatial relationships exhibit only
An advantage of our representation learning ap- a small number of variations across templates, and
proach is that it allows us to encode certain priors these tend to generalize across fields and domains.
we developed based on these observations into the
Observation 3 Key phrases for a field are largely
architecture of the neural network and its input fea-
drawn from a small vocabulary of field-specific
tures (see Section 4). In fact, our experiments show
variants. In a corpus of invoices we collected, we
that our proposed neural architecture outperforms a
observed that, as exemplified by the samples in Fig-
more naive MLP baseline using the same input fea-
ure 1, about 93% of the nearly 8400 invoice date
tures by about 10 F1 points on the extraction task
instances were associated with key phrases that in-
for two different domains (see Section 6). Further-
cluded the words “date” or “dated” and about 30%
more, the learned candidate representations are also
included “invoice”. Only about 7% of invoice dates
meaningful and lend themselves to interpretation,
had neither of these words in their key phrases.
as we show by delving into some loss cases.
Similarly, 87% of the nearly 2800 due_date in-
2 Observations about Forms stances in our corpus had key phrases that con-
tained the word “due” and 81% contained “date”.
We make three key observations about form-like We found similar patterns for all other fields we
documents that inform our design. investigated. The fact that there are only a small
Observation 1 Each field often corresponds to a number of field-specific key phrases suggests that
well-understood type. For example, the only likely this problem may be tractable with modest amounts
extraction candidates for the invoice_date of training data.
field in an invoice are instances of dates. A cur- While these observations are applicable to many
rency amount like $25.00 would clearly be incor- fields across different document types, there are
rect. Since there are orders of magnitude fewer several exceptions which we plan to tackle in future
dates on an invoice as there are text tokens, limit- work.
ing the search space by type dramatically simplifies
3 Extraction Pipeline
the problem. Consequently, we use a library of de-
tectors for several common types such as dates, We leveraged the observations laid out in Section 2
currency amounts, integers, address portals, emails to build a system to solve the information extraction
addresses, etc. to generate candidates. task for form-like documents. Given a document
and a target schema, we generate extraction candi- task is to identify the correct extraction candidate
dates for each field from the document text using (if any) for each field. While there are many ap-
the field type. We then score each candidate inde- proaches one could take to solve this problem, we
pendently using a neural scoring model. Finally, made the design choice to break it down to two
we assign at most one scored candidate as an ex- steps: first, we compute a score ∈ [0, 1] for each
traction result for each field. We discuss the stages candidate independently using a neural model, then
of this pipeline here, and delve into the architecture we assign to each field the scored candidate that is
of the scoring model in Section 4. most likely to be the true extraction for it.
This separation of scoring and assignment al-
3.1 Ingestion
lows us to learn a representation for each candidate
Our system can ingest both native digital as well as based only on its neighborhood, independently of
scanned documents. We render each document to other candidates and fields. It also frees us to en-
an image and use a cloud OCR service1 to extract code arbitrarily complex business rules into the
all the text in it. assigner if required, for example, that the due date
The text in the OCR result is arranged in the for an invoice cannot (chronologically) precede its
form of a hierarchy with individual characters at invoice date, or that the line item prices must sum
the leaf level, and words, paragraphs and blocks up to the total.
respectively in higher levels. The nodes in each For brevity, we omit the details of the assignment
level of the hierarchy are associated with bounding module and report results using a simple assigner
boxes represented in the 2D Cartesian plane of that chooses the highest-scoring candidate for each
the document page. The words in a paragraph are field independently of other fields.
arranged in reading order, as are the paragraphs
and blocks themselves. 4 Neural Scoring Model
3.2 Candidate Generation The scoring module takes as input the target field
In Section 2, we made the observation that fields in from the schema and the extraction candidate to
our target schema correspond to well-understood produce a prediction score ∈ [0, 1]. While the
types like dates, integers, currency amounts, ad- downstream assignement module consumes the
dresses, etc. There are well-known techniques to scores directly, the scorer is trained and evalu-
detect instances of these types in text, ranging from ated as a binary classifier. The target label for a
regular expression matching and heuristics to se- candidate is determined by whether the candidate
quence labeling using models trained on web data. matches the ground truth for that document and
We associate each field type supported by our field.
system with one or more candidate generators. An important desideratum for us in the design of
These generators use a cloud-based entity extrac- the scorer is that it learns a meaningful candidate
tion service2 to detect spans of the OCR text ex- representation. We propose an architecture where
tracted from the documents that are instances of the model learns separate embeddings for the can-
the corresponding type. For example, every date didate and the field it belongs to, and where the
in an invoice becomes a candidate for every date similarity between the candidate and field embed-
field in the target schema, viz. invoice_date, dings determines the score.
due_date and delivery_date. We believe that such an architecture allows a
Since the recall of the overall extraction system single model to learn candidate representations that
cannot exceed that of the candidate generators, it generalize across fields and document templates.
is important that their recall be high. Precision is, We can conceptualize the learned representation of
however, largely the responsibility of the scorer a candidate as encoding what words in its neighbor-
and assigner. hood form its associated key phrase since, apropos
Observation 2, the spatial relationships between
3.3 Scoring and Assignment
candidates and their key phrases are observed to
Given a set of candidates from a document for each generalize across fields. On the other hand, the
field in the target schema, the crux of the extraction embedding for a field can be conceptualized as
1
cloud.google.com/vision encoding the key phrase variants that are usually
2
cloud.google.com/natural-language indicative of it, apropos Observation 3.
Figure 2: Neighbor ‘Invoice’ for invoice_date
candidate with relative position (−0.06, −0.01).
invoice_id alphanum 13,719 6.8% 0.983 0.988 0.993 0.913 0.937 0.949
purchase_order alphanum 13,262 2.2% 0.959 0.967 0.976 0.826 0.851 0.896
total_amount currency 8,182 12.5% 0.966 0.972 0.980 0.834 0.849 0.858
total_tax_amount currency 2,949 7.5% 0.975 0.967 0.980 0.756 0.812 0.839
Macro-average - 14.9% 0.973 0.974 0.982 0.842 0.861 0.878
Receipts
date date 258 85.5% 0.748 0.792 0.737 0.885 0.885 0.854
total currency 475 16.7% 0.834 0.796 0.889 0.631 0.607 0.813
Macro-average - 51.1% 0.791 0.794 0.813 0.758 0.746 0.833
Table 2: Performance on the test set of unseen templates for Invoices and Receipts. The best-performing architec-
ture in each case is highlighted.
which translates to about 3.6 points improvement position > neighbor position. This result is also
in the end-to-end Max F1. In fact, our model beats borne out by the fact that the BoW baseline, which
the baseline in every field in our invoice target omits the last of these features, is quite competitive
schema as well. This difference in performance with the other approaches.
clearly demonstrates the need to incorporate token We also compared the performance of our
positions to extract information accurately from proposed architecture with and without the self-
form-like documents. Using neighbor position in- attention layer applied to the neighbor encodings.
formation, the MLP baseline is able to outperform We found that self-attention contributes greatly to
the BoW baseline as well, but the improvement in model performance for the invoice corpus: not only
end-to-end Max F1 is only about 2 points. This did self-attention lead to a 1-point improvement in
result demonstrates that our proposed architecture scorer ROC AUC and a 1.7 point improvement in
is better able to encode position information than a end-to-end max F1, we also observed an improve-
naive MLP. ment in every single field in our invoice schema.
Similarly, for the receipt corpus also, our model
outperforms both the baselines. The improvement 6.2 Meaningful internal representations
is much larger for the total field, more than 20 We investigated the internal representations learned
points. For the date field, since there are too few by our model by visualizing their 2-D projections
negative candidates in the dataset, all the models using TSNE. Figure 4(a) shows the representa-
have comparable performance end-to-end. tions learned for date candidates. They are colored
A close examination of the per-field performance based on the ground truth data indicating if they be-
metrics in Table 2 reveals that model performance long to one of invoice_date, due_date, or
is greatly affected by both the number of posi- delivery_date. The learned encodings clearly
tive training candidates, as well as by the ratio show three distinct (by color) coherent clusters
of positives to negatives. The best performance matching the respective field labels.
is observed for fields that occur frequently in in- Figure 4(b) shows the candidate encodings for a
voices (e.g., invoice_id) and where the candi- sample of positive and negative date candidates for
date generator emits only a small number of neg- the invoice_date field, along with the embed-
atives for each positive (e.g., invoice_date). ding for that field. It is apparent that the encodings
Conversely, the fields that are hardest to extract are of the positive examples are largely clustered to-
those that are relatively rare and have low-precision gether whereas the sampled negatives show a more
candidate generators, viz., amount_due and uniform and sparse spatial distribution. Further-
total_tax_amount. more, the field embedding lies close to the cluster
We also studied our model performance over of positive examples. It is interesting to note that
various ablation setups and found that the relative the field embedding lies not at the center of the
order in which various features influence general- cluster, but rather at its edge, as far away as possi-
ization performance is: neighbor text > candidate ble from the clusters of positive examples for other
Figure 4: TSNE visualizations for (a) positive candidate encodings for the date fields in the target schema for in-
voices, and (b) positive and negative candidate encodings for invoice_date field as well as its field embedding.
(c), (d) and (e) show three cases of misclustered candidate encodings
fields. This pattern is predicted by the fact that the 7 Related Work
loss function is essentially trying to minimize the
cosine distance between the field embedding and Information extraction from plain text documents
its positives, while maximizing its distance from for tasks like named entity recognition and relation
its negatives, most importantly the positives for the extraction have benefited from recent advances in
other fields. deep learning (Lample et al., 2016; Peng et al.,
We also indicate three cases of misclustered can- 2017). However, these techniques are not directly
didate encodings in Figure 4(a), whose correspond- applicable to our task on form-like documents.
ing invoice candidates and their neighborhoods are Palm et al. (2017) attempts to use RNNs to extract
excerpted below. Figure 4(c) shows a ground truth information from form-like documents. However,
positive invoice_date example whose encod- they treat each line as a vector of n-grams limiting
ing is far from the invoice_date cluster. It is the resulting accuracy.
clear from examining the invoice that this is an The importance of understanding visual layout
error in the ground truth labels provided by the was recognized even in the context of information
human annotator. In fact, this date is the date extraction of webpages in recent work (Cai et al.,
of purchase and not the invoice date. The can- 2004; Yu et al., 2003; Zhu et al., 2006; Cai et al.,
didate shown in Figure 4(d) has a candidate en- 2003). The techniques developed by them are, how-
coding that lies midway between due_date, its ever, not immediately applicable in our context
true label, and invoice_date. We believe this since we do not have access to the source markup
is explained by the fact that this date has both the representation for the documents we deal with.
terms “Due Date” and “date of invoice” nearby, A common approach to solving the problem of
which are usually indicative of due_date and extracting information from form-like documents
invoice_date respectively. Finally, Figure 4(e) is to register templates in a system, match new doc-
shows a true invoice_date example whose en- uments to an existing template, and use an extractor
coding is far away from all the field clusters. A learnt from said template (Chiticariu et al., 2013;
closer examination of the features of this candidate Schuster et al., 2013). The learning problem we
showed that our OCR engine was unable to detect tackle in this paper is more ambitious; we seek to
the word “Date” just above the date due to scan- generalize to unseen templates.
ning noise. Since this crucial word was missing Our work is most closely related to recent at-
from the neighbors of this candidate, the learned tempts to combine layout features with text signals.
neighborhood representation was clearly incorrect. Liu et al. (2019) use a document graph and intro-
duce a graph combination model to combine visual References
and textual signals in the document. Katti et al. Yoshua Bengio, Aaron C. Courville, and Pascal Vin-
(2018) represent a document as a two-dimensional cent. 2013. Representation learning: A review and
grid of text tokens. Zhao et al. (2019) show that us- new perspectives. TPAMI, 35(8):1798–1828.
ing grid information can be useful for information
Deng Cai, Shipeng Yu, Ji-Rong Wen, and Wei-Ying
extraction tasks. Denk and Reisswig (2019) com- Ma. 2003. Extracting content structure for web
bine the grid-based approach with BERT-based text pages based on visual representation. In Web Tech-
encodings. While an apples-to-apples comparison nologies and Applications, APWeb, pages 406–417.
with these approaches is difficult without a shared
Deng Cai, Shipeng Yu, Ji-Rong Wen, and Wei-Ying
benchmark, our system has several advantages: in Ma. 2004. Block-based web search. In SIGIR,
contrast to the graph-based approaches (Liu et al., pages 456–463.
2019) we focus on the harder problem of general-
izing to unseen templates rather than dealing with Laura Chiticariu, Yunyao Li, and Frederick R. Reiss.
2013. Rule-based information extraction is dead!
the variations within a template. Since we are not long live rule-based information extraction systems!
starting with raw pixels, our approach is computa- In EMNLP, pages 827–832.
tionally less expensive than grid-based approaches.
Further, we do not require clever heuristics to con- Nilesh Dalvi, Ravi Kumar, and Mohamed Soliman.
2011. Automatic wrappers for large scale web ex-
struct a multi-scale grid that is required for the traction. In VLDB, volume 4, pages 219–230.
image-segmentation style abstraction to work well.
To the best of our knowledge, our approach of Brian L. Davis, Bryan S. Morse, Scott Cohen, Brian L.
using representation learning for this task is the Price, and Chris Tensmeyer. 2019. Deep visual
template-free form parsing. CoRR, abs/1909.02576.
first of its kind. We gain many of the well-known
benefits of this approach (Bengio et al., 2013), most Timo I. Denk and Christian Reisswig. 2019. Bert-
notably interpretability. grid: Contextualized embedding for 2d docu-
ment representation and understanding. CoRR,
8 Conclusion and Future Work abs/1909.04948.
In this paper, we presented a novel approach to iPayables. 2016. Why Automation Matters: A Survey
Study of the Modern Accounts Payable Department.
the task of extracting structured information from Technical report, iPayables.
templatic documents using representation learning.
We showed that our extraction system using this Anoop R. Katti, Christian Reisswig, Cordula Guder,
approach not only has promising accuracy on un- Sebastian Brarda, Steffen Bickel, Johannes Höhne,
and Jean Baptiste Faddoul. 2018. Chargrid: To-
seen templates in two different domains, but also wards understanding 2d documents. In EMNLP,
that the learned representations lend themselves to pages 4459–4469.
interpretation of loss cases.
In this initial foray into this challenging problem, Brian Kulis. 2013. Metric learning: A survey. Founda-
tions and Trends
R in Machine Learning, 5(4):287–
we limited our scope to fields with domain-agnostic 364.
types like dates and numbers, and which have only
one true value in a document. In future work, we Guillaume Lample, Miguel Ballesteros, Sandeep Sub-
hope to tackle repeated fields and learn domain- ramanian, Kazuya Kawakami, and Chris Dyer. 2016.
Neural architectures for named entity recognition.
specific candidate generators. We are also actively In NAACL, pages 260–270.
investigating how our learned candidate represen-
tations can be used for transfer learning to a new Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu
domain and, ultimately, in a few-shot setting. Chen, Xiaodong Liu, Jianfeng Gao, and Jiawei Han.
2020. On the variance of the adaptive learning rate
Acknowledgements We are grateful to Lauro and beyond. In ICLR.
Costa, Evan Huang, Will Lu, Lukas Rutishauser, Xiaojing Liu, Feiyu Gao, Qiong Zhang, and Huasha
Mu Wang, and Yang Xu on the Google Cloud team Zhao. 2019. Graph convolution for multimodal in-
for their support with data collection, benchmark- formation extraction from visually rich documents.
ing, and continuous feedback on our ideas. We are In NAACL, pages 32–39.
also grateful for our research intern, Beliz Gunel, Laurens van der Maaten and Geoffrey Hinton. 2008.
who helped re-run several experiments and fine- Visualizing data using t-sne. JMLR, 9(Nov):2579–
tune our training pipeline. 2605.
Rasmus Berg Palm, Ole Winther, and Florian Laws.
2017. Cloudscan - A configuration-free invoice anal-
ysis system using recurrent neural networks. In IC-
DAR, pages 406–413.
Nanyun Peng, Hoifung Poon, Chris Quirk, Kristina
Toutanova, and Wen-tau Yih. 2017. Cross-sentence
n-ary relation extraction with graph lstms. TACL,
5:101–115.
Sunita Sarawagi. 2008. Information extraction. Foun-
dations and Trends
R in Databases, 1(3):261–377.