0% found this document useful (0 votes)
95 views10 pages

Representation Learning For Information Extraction From Form-Like Documents

This document proposes a novel approach using representation learning to extract structured information from form-like document images. It observes that each field often corresponds to a well-defined type, each field instance is usually associated with a key phrase in close visual proximity, and these key phrases are drawn from a small vocabulary of field-specific variants. It then describes a neural network architecture that learns dense representations of extraction candidates to score how likely they are to be true field values, based on similarities to learned representations of the fields. Experimental results show this approach outperforms a simpler MLP baseline by about 10 F1 points on two domains.

Uploaded by

bithi jain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
95 views10 pages

Representation Learning For Information Extraction From Form-Like Documents

This document proposes a novel approach using representation learning to extract structured information from form-like document images. It observes that each field often corresponds to a well-defined type, each field instance is usually associated with a key phrase in close visual proximity, and these key phrases are drawn from a small vocabulary of field-specific variants. It then describes a neural network architecture that learns dense representations of extraction candidates to score how likely they are to be true field values, based on similarities to learned representations of the fields. Experimental results show this approach outperforms a simpler MLP baseline by about 10 F1 points on two domains.

Uploaded by

bithi jain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Representation Learning for Information Extraction

from Form-like Documents


Bodhisattwa Prasad Majumder†♣ Navneet Potti♠ Sandeep Tata♠
James B. Wendt♠ Qi Zhao♠ Marc Najork♠

Department of Computer Science and Engineering, UC San Diego
[email protected]

Google Research, Mountain View
{navsan, tata, jwendt, zhaqi,
najork}@google.com

Abstract
We propose a novel approach using represen-
tation learning for tackling the problem of ex-
tracting structured information from form-like
document images. We propose an extraction
system that uses knowledge of the types of the
target fields to generate extraction candidates,
and a neural network architecture that learns a
dense representation of each candidate based Figure 1: Excerpts from sample invoices from different
on neighboring words in the document. These vendors. Instances of the invoice_date field are
learned representations are not only useful in highlighted in green.
solving the extraction task for unseen docu-
ment templates from two different domains,
but are also interpretable, as we show using
loss cases. showing the invoice date (highlighted in green)
and number in different layouts. Furthermore, in-
1 Introduction voices from the same supplier even share similar
presentation and differ only in specific values. We
In this paper, we present a novel approach to the refer to this unit of visual pattern that is similar
task of extracting structured information from form- across a collection of documents as a template,
like documents using a learned representation of and the fields of information that are common
an extraction candidate. Form-like documents like across templates in a domain as the schema. The
invoices, purchase orders, tax forms and insurance schema consists of fields like invoice_date
quotes are common in day-to-day business work- and total_amount, each associated with a type
flows, but current techniques for processing them like date and currency.
largely still employ either manual effort or brit- Extracting values for these fields from a given
tle and error-prone heuristics for extraction. The document, particularly one belonging to an unseen
research question motivating our work is the fol- template, is a challenging problem for many rea-
lowing: given a target set of fields for a particular sons. In contrast to most prior work on information
domain – e.g., due date and total amount for in- extraction (Sarawagi, 2008), templatic documents
voices – along with a small set of manually-labeled do not contain much prose. Approaches that work
examples, can we learn to extract these fields from well on natural text organized in sentences can-
unseen documents? not be applied directly to such documents where
Take, for instance, the domain of invoices, a doc- spatial layout elements like tables and grid format-
ument type that large enterprises often receive and ting are commonplace. Understanding spatial rela-
process thousands of times every week (iPayables, tionships is critical for achieving good extraction
2016). Invoices from different vendors often performance on such documents. Moreover, these
present the same types of information but with dif- documents are usually in PDF or scanned image
ferent layouts and positioning. Figure 1 shows the formats, so these presentation hints are not explic-
headers of invoices from a few different vendors itly available in a markup language. Techniques

Work done during an internship at Google Research that are successful on HTML documents such as
web pages, including traditional wrapper induction Observation 2 Each field instance is usually as-
approaches (Dalvi et al., 2011), are therefore not sociated with a key phrase that bears an apparent
immediately applicable. visual relationship with it. Consider the invoice ex-
Recently, there has been a surge in research in- cerpt in Figure 1(c). It contains two date instances,
terest in solving this extraction task adapting tech- only one of which is the true invoice_date,
niques in natural language processing (Liu et al., as indicated by the word “Date” next to it. Simi-
2019), computer vision (Davis et al., 2019), or com- larly, in the bottom-right invoice excerpt, we are
binations thereof (Katti et al., 2018). In contrast to easily able to distinguish between the invoice num-
this body of work, we propose an approach based ber (indicated by “Invoice #”) and the purchase
on representation learning for this task. We first order number (indicated by “PO #”). We call such
generate extraction candidates for each target field indicative words key phrases.
using its associated type (e.g., all dates as candi- Proximity is not the only criterion that defines a
dates for invoice_date). We then use a neural key phrase. For instance, the word “Date” is not the
network model to learn a dense representation for nearest one to the true invoice_date instance
each extraction candidate independent of the field in Figure 1(c); the document number in the line
to which it belongs. We also learn a separate repre- above and the page number below are clearly closer.
sentation for the field itself, and use the similarity It is also not the case that the key phrase always
between the candidate and field representations to occurs on the same line; Figure 1(a) shows a case
score the candidate according to how likely it is to where the key phrase “DATE” occurs just above
be the true extraction value for that field. the true invoice_date. An effective solution
The design of our extraction system rests on a needs to combine the spatial information along
few observations about how information is often with the textual information. Fortunately, in our
laid out in form-like documents (see Section 2). experience, these spatial relationships exhibit only
An advantage of our representation learning ap- a small number of variations across templates, and
proach is that it allows us to encode certain priors these tend to generalize across fields and domains.
we developed based on these observations into the
Observation 3 Key phrases for a field are largely
architecture of the neural network and its input fea-
drawn from a small vocabulary of field-specific
tures (see Section 4). In fact, our experiments show
variants. In a corpus of invoices we collected, we
that our proposed neural architecture outperforms a
observed that, as exemplified by the samples in Fig-
more naive MLP baseline using the same input fea-
ure 1, about 93% of the nearly 8400 invoice date
tures by about 10 F1 points on the extraction task
instances were associated with key phrases that in-
for two different domains (see Section 6). Further-
cluded the words “date” or “dated” and about 30%
more, the learned candidate representations are also
included “invoice”. Only about 7% of invoice dates
meaningful and lend themselves to interpretation,
had neither of these words in their key phrases.
as we show by delving into some loss cases.
Similarly, 87% of the nearly 2800 due_date in-
2 Observations about Forms stances in our corpus had key phrases that con-
tained the word “due” and 81% contained “date”.
We make three key observations about form-like We found similar patterns for all other fields we
documents that inform our design. investigated. The fact that there are only a small
Observation 1 Each field often corresponds to a number of field-specific key phrases suggests that
well-understood type. For example, the only likely this problem may be tractable with modest amounts
extraction candidates for the invoice_date of training data.
field in an invoice are instances of dates. A cur- While these observations are applicable to many
rency amount like $25.00 would clearly be incor- fields across different document types, there are
rect. Since there are orders of magnitude fewer several exceptions which we plan to tackle in future
dates on an invoice as there are text tokens, limit- work.
ing the search space by type dramatically simplifies
3 Extraction Pipeline
the problem. Consequently, we use a library of de-
tectors for several common types such as dates, We leveraged the observations laid out in Section 2
currency amounts, integers, address portals, emails to build a system to solve the information extraction
addresses, etc. to generate candidates. task for form-like documents. Given a document
and a target schema, we generate extraction candi- task is to identify the correct extraction candidate
dates for each field from the document text using (if any) for each field. While there are many ap-
the field type. We then score each candidate inde- proaches one could take to solve this problem, we
pendently using a neural scoring model. Finally, made the design choice to break it down to two
we assign at most one scored candidate as an ex- steps: first, we compute a score ∈ [0, 1] for each
traction result for each field. We discuss the stages candidate independently using a neural model, then
of this pipeline here, and delve into the architecture we assign to each field the scored candidate that is
of the scoring model in Section 4. most likely to be the true extraction for it.
This separation of scoring and assignment al-
3.1 Ingestion
lows us to learn a representation for each candidate
Our system can ingest both native digital as well as based only on its neighborhood, independently of
scanned documents. We render each document to other candidates and fields. It also frees us to en-
an image and use a cloud OCR service1 to extract code arbitrarily complex business rules into the
all the text in it. assigner if required, for example, that the due date
The text in the OCR result is arranged in the for an invoice cannot (chronologically) precede its
form of a hierarchy with individual characters at invoice date, or that the line item prices must sum
the leaf level, and words, paragraphs and blocks up to the total.
respectively in higher levels. The nodes in each For brevity, we omit the details of the assignment
level of the hierarchy are associated with bounding module and report results using a simple assigner
boxes represented in the 2D Cartesian plane of that chooses the highest-scoring candidate for each
the document page. The words in a paragraph are field independently of other fields.
arranged in reading order, as are the paragraphs
and blocks themselves. 4 Neural Scoring Model
3.2 Candidate Generation The scoring module takes as input the target field
In Section 2, we made the observation that fields in from the schema and the extraction candidate to
our target schema correspond to well-understood produce a prediction score ∈ [0, 1]. While the
types like dates, integers, currency amounts, ad- downstream assignement module consumes the
dresses, etc. There are well-known techniques to scores directly, the scorer is trained and evalu-
detect instances of these types in text, ranging from ated as a binary classifier. The target label for a
regular expression matching and heuristics to se- candidate is determined by whether the candidate
quence labeling using models trained on web data. matches the ground truth for that document and
We associate each field type supported by our field.
system with one or more candidate generators. An important desideratum for us in the design of
These generators use a cloud-based entity extrac- the scorer is that it learns a meaningful candidate
tion service2 to detect spans of the OCR text ex- representation. We propose an architecture where
tracted from the documents that are instances of the model learns separate embeddings for the can-
the corresponding type. For example, every date didate and the field it belongs to, and where the
in an invoice becomes a candidate for every date similarity between the candidate and field embed-
field in the target schema, viz. invoice_date, dings determines the score.
due_date and delivery_date. We believe that such an architecture allows a
Since the recall of the overall extraction system single model to learn candidate representations that
cannot exceed that of the candidate generators, it generalize across fields and document templates.
is important that their recall be high. Precision is, We can conceptualize the learned representation of
however, largely the responsibility of the scorer a candidate as encoding what words in its neighbor-
and assigner. hood form its associated key phrase since, apropos
Observation 2, the spatial relationships between
3.3 Scoring and Assignment
candidates and their key phrases are observed to
Given a set of candidates from a document for each generalize across fields. On the other hand, the
field in the target schema, the crux of the extraction embedding for a field can be conceptualized as
1
cloud.google.com/vision encoding the key phrase variants that are usually
2
cloud.google.com/natural-language indicative of it, apropos Observation 3.
Figure 2: Neighbor ‘Invoice’ for invoice_date
candidate with relative position (−0.06, −0.01).

4.1 Candidate features


We would like our model to learn a representa-
tion of a candidate that captures its neighborhood. Figure 3: Neural Scoring Model. Pos. = Positional,
Cand. = Candidate, Embed. = Embedding
Accordingly, the essential features of a candidate
are the text tokens that appear nearby, along with
their positions. We use a simple heuristic to de- The neighboring text tokens are embedded using a
termine what OCR text tokens we consider to be word embedding table. Each neighbor relative po-
the neighbors of a given candidate: we define a sition is embedded through a nonlinear positional
neighborhood zone around the candidate extending embedding consisting of two ReLU-activated lay-
all the way to the left of the page and about 10% ers with dropout. This nonlinear embedding allows
of the page height above it. Any text tokens whose the model to learn to resolve fine-grained differ-
bounding boxes overlap by more than half with the ences in position, say between neighbors sharing
neighborhood zone is considered to be a neighbor. the same line as the candidate and those on the line
As shown in Figure 2, we represent the position above. The candidate position feature is embedded
of a candidate and each of its neighbors using the using just a linear layer. We also use an embedding
2-D Cartesian coordinates of the centroids of their table for the field to which a candidate belongs.
respective bounding boxes. These coordinates are In a model with embedding dimension d, the
normalized by dividing by the corresponding page sizes of each neighbor’s word and position embed-
dimensions so that the features are independent of dings are set to be d. We experimented with dif-
the pixel resolution of the input documents. We ferent sizes for the word and position embeddings,
calculate the relative position of a neighbor as the but it did not make a significant difference. For
difference between its normalized 2-D coordinates simplicity of exposition, we use the same value for
and those of the candidate. An additional feature both. Since each candidate is padded to have the
we found to be helpful is the absolute position of same number of neighbors, say N , we denote the
the candidate itself. neighbor embeddings {h1 , h2 , . . . , hN }, with each
An important design choice we made is to not hi ∈ R2d . We also set the sizes of the candidate
incorporate the candidate text into the input. Note position embedding as well as the field embedding
that this text was already the basis for generating to be d.
the candidate in the first place. Withholding this
information from the input to the model avoids ac- Neighbor Encodings It is important to note that
cidental overfitting to our somewhat-small training the initial neighbor embeddings hi (Figure 3 (d))
datasets. For instance, since the invoices we col- are independent of each other. In order to cap-
lected were all dated prior to 2019, it is possible ture interactions between neighbors, we employ
that providing the date itself as input to the model self-attention (Vaswani et al., 2017), allowing each
could cause it to learn that true invoice_date neighbor to have its embedding affected by all oth-
instances always occur prior to 2019. ers. This is useful, for example, for the model to
downweight a neighbor that has other neighbors
4.2 Embeddings between itself and the candidate.
As shown in Figure 3 (a)-(d), we embed each of the We pack the neighbor embeddings hi into a
candidate features separately in the following ways. matrix H ∈ RN ×2d , then transform these em-
bdeddings into query, key and value embeddings Corpus Split # Docs # Templates
through three different linear projection matrices Train 11,390 11,390
Invoices1
Wq , Wk and Wv ∈ R2d×2d . Validation 2,847 2,847
Invoices2 Test 595 595
qi = hi Wq K = HWk V = HWv Train 237 141
Receipts Validation 71 47
For each neighbor i, its query embedding qi Test 170 46
and the key embeddings K are used to obtain the
attention weight vector αi ∈ RN as follows. Table 1: Invoices and Receipts corpora
Ç å
qi K T
αi = Softmax √
2d Rd , we compute CosineSimilarity(c, f ) ∈ [−1, 1].
Finally, the model’s prediction is simply a (con-
The self-attended neighbor encoding h̃i ∈ R2d stant) linear rescaling of this similarity so that the
(see Figure 3(e)) for neighbor i is a linear combina- scores lie in [0, 1]. The model is trained using bi-
tion of the value embeddings, V ∈ RN ×2d , using nary cross entropy between this prediction and the
the above attention weights for all the neighbors target label as the loss function.
h̃i = αi V . Intuitively, this architecture ensures that the pos-
As in Vaswani et√al. (2017), we use a normal- itive candidates for a field cluster together near
ization constant of 2d to improve stability. We its field embedding, and that these clusters are set
project the self-attended neighbor encodings to a far apart from each other. We use TSNE (Maaten
larger 4 × 2d dimensional space using a linear pro- and Hinton, 2008) to visualize this phenomenon in
jection with ReLU nonlinearity, and then project Section 6.2.
them back to 2d.
5 Datasets
4.3 Candidate Encoding
We combine the N neighbor encodings of size 2d To analyze the performance of our model, we used
each to form a single encoding of size 2d for the datasets belonging to two different domains, sum-
entire neighborhood. Since we already capture in- marized in Table 1.
formation about the relative positions of the neigh-
Invoices We collected two corpora of invoices
bors with respect to the candidates in the embed-
from different sources. The first corpus, Invoices1,
dings themselves, it is important to ensure that the
contains 14,237 single-page invoices. Each invoice
neighborhood encoding is invariant to the (arbi-
was from a different vendor, so the documents do
trary) order in which the neighbors are included in
not share any common templates. Documents from
the features. Our experiments indicate that max-
the same vendor are generated from the same tem-
pooling the neighbor encodings together was the
plate. The second corpus, Invoices2, contains 595
best strategy, slightly beating out mean-pooling.
documents belonging to different templates, with
Next, we obtain a candidate encoding (see Fig-
no templates in common with Invoices1. In all of
ure 3(f, h, i)) by concatenating the neighborhood
our experiments, we used a 60-40 split of templates
encoding ∈ R2d with the candidate position em-
in Invoices1 as our training and validation sets, and
bedding ∈ Rd and projecting (through a ReLU-
all the templates in Invoices2 as our test set.
activated linear layer) back down to d dimensions.
We asked human annotators to provide us ground
Candidate Scoring The candidate encoding is truth extraction results for the fields shown in Ta-
expected to contain all relevant information about ble 2. The candidate generator associated with each
the candidate, including its position and its neigh- field type was used to generate examples, which
borhood. By design, it is independent of the field to were then labeled using the ground truth.
which said candidate belongs. This neural network About 95% of documents and fields present
is, however, trained as a binary classifier to score the training set had at least one positive example
a candidate according to how likely it is to be the produced by our candidate generators. The field-
true extraction value for some field and document. level recall of our candidate generators varies from
Drawing inspiration from prior work in metric about 87% for invoice_id to about 99% for
learning (Kulis, 2013), given a field with embed- invoice_date. Improving the recall of candi-
ding f ∈ Rd and its candidate with encoding c ∈ date generators is part of our ongoing effort.
While the candidate generators have reason- 6 Experiments
ably high recall, their precision varies dramat-
ically from field to field. For common fields In this section, we evaluate our scoring model with
like invoice_date and total_amount that respect to our two key desiderata. First, in Sec-
are present in nearly all documents, we gen- tion 6.1, we show that our model is able to help the
erate fewer than ten negatives for each posi- extraction system generalize to unseen templates.
tive example. On the other hand, for rare Then, in Section 6.2, we probe the model to show
fields like total_tax_amount as well as for that it learns meaningful internal representations.
fields with low-precision candidate generators In the experiments described below, we trained
such as the alphanum candidate generator for models using the Rectified Adam (Liu et al., 2020)
purchase_order, there can sometimes be optimizer with a learning rate of 0.001 for 50
dozens of negatives for each positive. Overall, epochs. For both the Invoices and Receipts datasets
since the negatives far outnumber the positives, described in Section 5, we used the training split
we found it helpful to randomly downsample nega- to train the model, the validation split to pick the
tives in the training set to keep at most 40 negatives model with the best hold-out loss, and the test split
for each positive per field. The negatives in the to report performance metrics.
validation and test sets were not downsampled.
6.1 Generalization to unseen templates
We created a vocabulary of the 512 most fre-
quent tokens, case-normalized, taken from the We measured the performance of our model’s scor-
OCR text of the documents in Invoices1. The vo- ing predictions using ROC AUC on the test split.
cabulary also includes special tokens for numbers We also analyzed its performance in the context
([NUMBER]), out-of-vocabulary tokens ([RARE]) of the overall extraction system using the accuracy
and padding ([PAD]). Despite the small size of of the end-to-end extraction results as measured
this vocabulary, it covered at least 95% of words by the maximum F1 score over all decision thresh-
that occurred in key phrases across the entire corpus olds, averaged across all fields in the target schema
where excluded words were usually OCR errors. shown in Table 2.
To demonstrate the benefits of our proposed neu-
Receipts We also evaluated our model using a ral architecture over a naive approach, we use two
publicly-available corpus of scanned receipts pub- different baseline models for encoding a candidate
lished as part of the ICDAR 2019 Robust Reading and scoring it. The bag-of-words BoW baseline
Challenge on Scanned Receipts OCR and Infor- incorporates only the neighboring tokens of a can-
mation Extraction3 . This corpus contains 626 re- didate, but not their positions. The MLP base-
ceipt images with ground truth extraction results line uses the same input features as our proposed
for four fields, viz., address, company, date model, including the relative positions of the candi-
and total. Using the company annotation as the date’s neighbors, and encodes the candidate using
template mapping, we found that these documents 3 hidden layers. Both these baselines follow our
belong to 234 templates. The largest template con- representation learning approach, encoding the can-
tains 46 receipts and about half the documents be- didate and the field separately. Just as in our model,
long to 13 templates with more than 10 documents the final score is the cosine distance between the
each. On the other hand, nearly 70% of templates candidate and field encodings, normalized to [0, 1]
only have a single document. In all of our exper- using a sigmoid.
iments, we used a 60-20-20 split of templates as We chose the dimension size for each model
our training, validation and test sets respectively, architecture using a grid-based hyperparameter
sampling at most 5 documents from each template. search. All the metrics we report were obtained
Our target schema for this extraction task con- from performing 10 training runs and picking the
sists of the date and total fields. We generated model with the best validation ROC AUC.
labeled examples for these two fields using a vocab- Table 2 summarizes the results of this per-
ulary created as above from the 512 most frequent formance comparison. On both our evaluation
terms in the OCR text of the receipts. The fields in datasets, our model showed a significant improve-
this dataset did not suffer from the label imbalance ment over the baselines by both metrics. For the
problem highlighted above for invoices. invoice corpus, our model outperforms the BoW
3
rrc.cvc.uab.es/?ch=13 baseline by about 1 point in the scorer ROC AUC,
Train Test Scorer ROC AUC End-to-End Max F1
Corpus Field Field Type
# +ves % +ves BoW MLP Ours BoW MLP Ours
amount_due currency 5,930 4.8% 0.967 0.968 0.973 0.800 0.789 0.801
due_date date 5,788 12.9% 0.977 0.973 0.984 0.835 0.850 0.861
invoice_date date 13,638 57.4% 0.983 0.986 0.986 0.933 0.939 0.940
Invoices

invoice_id alphanum 13,719 6.8% 0.983 0.988 0.993 0.913 0.937 0.949
purchase_order alphanum 13,262 2.2% 0.959 0.967 0.976 0.826 0.851 0.896
total_amount currency 8,182 12.5% 0.966 0.972 0.980 0.834 0.849 0.858
total_tax_amount currency 2,949 7.5% 0.975 0.967 0.980 0.756 0.812 0.839
Macro-average - 14.9% 0.973 0.974 0.982 0.842 0.861 0.878
Receipts

date date 258 85.5% 0.748 0.792 0.737 0.885 0.885 0.854
total currency 475 16.7% 0.834 0.796 0.889 0.631 0.607 0.813
Macro-average - 51.1% 0.791 0.794 0.813 0.758 0.746 0.833

Table 2: Performance on the test set of unseen templates for Invoices and Receipts. The best-performing architec-
ture in each case is highlighted.

which translates to about 3.6 points improvement position > neighbor position. This result is also
in the end-to-end Max F1. In fact, our model beats borne out by the fact that the BoW baseline, which
the baseline in every field in our invoice target omits the last of these features, is quite competitive
schema as well. This difference in performance with the other approaches.
clearly demonstrates the need to incorporate token We also compared the performance of our
positions to extract information accurately from proposed architecture with and without the self-
form-like documents. Using neighbor position in- attention layer applied to the neighbor encodings.
formation, the MLP baseline is able to outperform We found that self-attention contributes greatly to
the BoW baseline as well, but the improvement in model performance for the invoice corpus: not only
end-to-end Max F1 is only about 2 points. This did self-attention lead to a 1-point improvement in
result demonstrates that our proposed architecture scorer ROC AUC and a 1.7 point improvement in
is better able to encode position information than a end-to-end max F1, we also observed an improve-
naive MLP. ment in every single field in our invoice schema.
Similarly, for the receipt corpus also, our model
outperforms both the baselines. The improvement 6.2 Meaningful internal representations
is much larger for the total field, more than 20 We investigated the internal representations learned
points. For the date field, since there are too few by our model by visualizing their 2-D projections
negative candidates in the dataset, all the models using TSNE. Figure 4(a) shows the representa-
have comparable performance end-to-end. tions learned for date candidates. They are colored
A close examination of the per-field performance based on the ground truth data indicating if they be-
metrics in Table 2 reveals that model performance long to one of invoice_date, due_date, or
is greatly affected by both the number of posi- delivery_date. The learned encodings clearly
tive training candidates, as well as by the ratio show three distinct (by color) coherent clusters
of positives to negatives. The best performance matching the respective field labels.
is observed for fields that occur frequently in in- Figure 4(b) shows the candidate encodings for a
voices (e.g., invoice_id) and where the candi- sample of positive and negative date candidates for
date generator emits only a small number of neg- the invoice_date field, along with the embed-
atives for each positive (e.g., invoice_date). ding for that field. It is apparent that the encodings
Conversely, the fields that are hardest to extract are of the positive examples are largely clustered to-
those that are relatively rare and have low-precision gether whereas the sampled negatives show a more
candidate generators, viz., amount_due and uniform and sparse spatial distribution. Further-
total_tax_amount. more, the field embedding lies close to the cluster
We also studied our model performance over of positive examples. It is interesting to note that
various ablation setups and found that the relative the field embedding lies not at the center of the
order in which various features influence general- cluster, but rather at its edge, as far away as possi-
ization performance is: neighbor text > candidate ble from the clusters of positive examples for other
Figure 4: TSNE visualizations for (a) positive candidate encodings for the date fields in the target schema for in-
voices, and (b) positive and negative candidate encodings for invoice_date field as well as its field embedding.
(c), (d) and (e) show three cases of misclustered candidate encodings

fields. This pattern is predicted by the fact that the 7 Related Work
loss function is essentially trying to minimize the
cosine distance between the field embedding and Information extraction from plain text documents
its positives, while maximizing its distance from for tasks like named entity recognition and relation
its negatives, most importantly the positives for the extraction have benefited from recent advances in
other fields. deep learning (Lample et al., 2016; Peng et al.,
We also indicate three cases of misclustered can- 2017). However, these techniques are not directly
didate encodings in Figure 4(a), whose correspond- applicable to our task on form-like documents.
ing invoice candidates and their neighborhoods are Palm et al. (2017) attempts to use RNNs to extract
excerpted below. Figure 4(c) shows a ground truth information from form-like documents. However,
positive invoice_date example whose encod- they treat each line as a vector of n-grams limiting
ing is far from the invoice_date cluster. It is the resulting accuracy.
clear from examining the invoice that this is an The importance of understanding visual layout
error in the ground truth labels provided by the was recognized even in the context of information
human annotator. In fact, this date is the date extraction of webpages in recent work (Cai et al.,
of purchase and not the invoice date. The can- 2004; Yu et al., 2003; Zhu et al., 2006; Cai et al.,
didate shown in Figure 4(d) has a candidate en- 2003). The techniques developed by them are, how-
coding that lies midway between due_date, its ever, not immediately applicable in our context
true label, and invoice_date. We believe this since we do not have access to the source markup
is explained by the fact that this date has both the representation for the documents we deal with.
terms “Due Date” and “date of invoice” nearby, A common approach to solving the problem of
which are usually indicative of due_date and extracting information from form-like documents
invoice_date respectively. Finally, Figure 4(e) is to register templates in a system, match new doc-
shows a true invoice_date example whose en- uments to an existing template, and use an extractor
coding is far away from all the field clusters. A learnt from said template (Chiticariu et al., 2013;
closer examination of the features of this candidate Schuster et al., 2013). The learning problem we
showed that our OCR engine was unable to detect tackle in this paper is more ambitious; we seek to
the word “Date” just above the date due to scan- generalize to unseen templates.
ning noise. Since this crucial word was missing Our work is most closely related to recent at-
from the neighbors of this candidate, the learned tempts to combine layout features with text signals.
neighborhood representation was clearly incorrect. Liu et al. (2019) use a document graph and intro-
duce a graph combination model to combine visual References
and textual signals in the document. Katti et al. Yoshua Bengio, Aaron C. Courville, and Pascal Vin-
(2018) represent a document as a two-dimensional cent. 2013. Representation learning: A review and
grid of text tokens. Zhao et al. (2019) show that us- new perspectives. TPAMI, 35(8):1798–1828.
ing grid information can be useful for information
Deng Cai, Shipeng Yu, Ji-Rong Wen, and Wei-Ying
extraction tasks. Denk and Reisswig (2019) com- Ma. 2003. Extracting content structure for web
bine the grid-based approach with BERT-based text pages based on visual representation. In Web Tech-
encodings. While an apples-to-apples comparison nologies and Applications, APWeb, pages 406–417.
with these approaches is difficult without a shared
Deng Cai, Shipeng Yu, Ji-Rong Wen, and Wei-Ying
benchmark, our system has several advantages: in Ma. 2004. Block-based web search. In SIGIR,
contrast to the graph-based approaches (Liu et al., pages 456–463.
2019) we focus on the harder problem of general-
izing to unseen templates rather than dealing with Laura Chiticariu, Yunyao Li, and Frederick R. Reiss.
2013. Rule-based information extraction is dead!
the variations within a template. Since we are not long live rule-based information extraction systems!
starting with raw pixels, our approach is computa- In EMNLP, pages 827–832.
tionally less expensive than grid-based approaches.
Further, we do not require clever heuristics to con- Nilesh Dalvi, Ravi Kumar, and Mohamed Soliman.
2011. Automatic wrappers for large scale web ex-
struct a multi-scale grid that is required for the traction. In VLDB, volume 4, pages 219–230.
image-segmentation style abstraction to work well.
To the best of our knowledge, our approach of Brian L. Davis, Bryan S. Morse, Scott Cohen, Brian L.
using representation learning for this task is the Price, and Chris Tensmeyer. 2019. Deep visual
template-free form parsing. CoRR, abs/1909.02576.
first of its kind. We gain many of the well-known
benefits of this approach (Bengio et al., 2013), most Timo I. Denk and Christian Reisswig. 2019. Bert-
notably interpretability. grid: Contextualized embedding for 2d docu-
ment representation and understanding. CoRR,
8 Conclusion and Future Work abs/1909.04948.

In this paper, we presented a novel approach to iPayables. 2016. Why Automation Matters: A Survey
Study of the Modern Accounts Payable Department.
the task of extracting structured information from Technical report, iPayables.
templatic documents using representation learning.
We showed that our extraction system using this Anoop R. Katti, Christian Reisswig, Cordula Guder,
approach not only has promising accuracy on un- Sebastian Brarda, Steffen Bickel, Johannes Höhne,
and Jean Baptiste Faddoul. 2018. Chargrid: To-
seen templates in two different domains, but also wards understanding 2d documents. In EMNLP,
that the learned representations lend themselves to pages 4459–4469.
interpretation of loss cases.
In this initial foray into this challenging problem, Brian Kulis. 2013. Metric learning: A survey. Founda-
tions and Trends R in Machine Learning, 5(4):287–
we limited our scope to fields with domain-agnostic 364.
types like dates and numbers, and which have only
one true value in a document. In future work, we Guillaume Lample, Miguel Ballesteros, Sandeep Sub-
hope to tackle repeated fields and learn domain- ramanian, Kazuya Kawakami, and Chris Dyer. 2016.
Neural architectures for named entity recognition.
specific candidate generators. We are also actively In NAACL, pages 260–270.
investigating how our learned candidate represen-
tations can be used for transfer learning to a new Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu
domain and, ultimately, in a few-shot setting. Chen, Xiaodong Liu, Jianfeng Gao, and Jiawei Han.
2020. On the variance of the adaptive learning rate
Acknowledgements We are grateful to Lauro and beyond. In ICLR.
Costa, Evan Huang, Will Lu, Lukas Rutishauser, Xiaojing Liu, Feiyu Gao, Qiong Zhang, and Huasha
Mu Wang, and Yang Xu on the Google Cloud team Zhao. 2019. Graph convolution for multimodal in-
for their support with data collection, benchmark- formation extraction from visually rich documents.
ing, and continuous feedback on our ideas. We are In NAACL, pages 32–39.
also grateful for our research intern, Beliz Gunel, Laurens van der Maaten and Geoffrey Hinton. 2008.
who helped re-run several experiments and fine- Visualizing data using t-sne. JMLR, 9(Nov):2579–
tune our training pipeline. 2605.
Rasmus Berg Palm, Ole Winther, and Florian Laws.
2017. Cloudscan - A configuration-free invoice anal-
ysis system using recurrent neural networks. In IC-
DAR, pages 406–413.
Nanyun Peng, Hoifung Poon, Chris Quirk, Kristina
Toutanova, and Wen-tau Yih. 2017. Cross-sentence
n-ary relation extraction with graph lstms. TACL,
5:101–115.
Sunita Sarawagi. 2008. Information extraction. Foun-
dations and Trends R in Databases, 1(3):261–377.

Daniel Schuster, Klemens Muthmann, Daniel Esser,


Alexander Schill, Michael Berger, Christoph Wei-
dling, Kamil Aliyev, and Andreas Hofmeier. 2013.
Intellix - end-user trained information extraction for
document archiving. In ICDAR, pages 101–105.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. In NIPS, pages 5998–6008.
Shipeng Yu, Deng Cai, Ji-Rong Wen, and Wei-Ying
Ma. 2003. Improving pseudo-relevance feedback in
web information retrieval using web page segmenta-
tion. In WWW, pages 11–18.
Xiaohui Zhao, Zhuo Wu, and Xiaoguang Wang. 2019.
CUTIE: learning to understand documents with
convolutional universal text information extractor.
CoRR, abs/1903.12363.
Jun Zhu, Zaiqing Nie, Ji-Rong Wen, Bo Zhang, and
Wei-Ying Ma. 2006. Simultaneous record detection
and attribute labeling in web data extraction. In
KDD, pages 494–503.

You might also like