New Ranking Algorithms For Parsing and Tagging: Kernels Over Discrete Structures, and The Voted Perceptron
New Ranking Algorithms For Parsing and Tagging: Kernels Over Discrete Structures, and The Voted Perceptron
i=1
rithm, in the next section.
The idea of inner products between feature vectors 3.2 The Perceptron Algorithm
is central to learning algorithms such as Support Figure 1(a) shows the perceptron algorithm applied
Vector Machines (SVMs), and is also central to the to the ranking task. The method assumes a training
ideas in this paper. Intuitively, the inner product set as described in section 3.1, and a representation
is a similarity measure between objects: structures h of parse trees. The algorithm maintains a param-
with similar feature vectors will have high values for eter vector w, which is initially set to be all zeros.
() ()
h x h y . More formally, it has been observed that The algorithm then makes a pass over the training
many algorithms can be implemented using inner set, only updating the parameter vector when a mis-
products between training examples alone, without take is made on an example. The parameter vec-
direct access to the feature vectors themselves. As tor update is very simple, involving adding the dif-
we will see in this paper, this can be crucial for the ference of the offending examples’ representations
efficiency of learning with certain representations. (w = + ( ) ( )
w h xi1 h xij in the figure). Intu-
Following the SVM literature, we call a function
( )
itively, this update has the effect of increasing the
K x; y of two objects x and y a “kernel” if it can parameter values for features in the correct tree, and
be shown that K is an inner product in some feature downweighting the parameter values for features in
space h. the competitor.
See (Cristianini and Shawe-Taylor 2000) for dis-
3 Algorithms
cussion of the perceptron algorithm, including an
3.1 Notation overview of various theorems justifying this way of
setting the parameters. Briefly, the perceptron algo-
This section formalizes the idea of linear models for
rithm is guaranteed3 to find a hyperplane that cor-
parsing or tagging. The method is related to the
rectly classifies all training points, if such a hyper-
boosting approach to ranking problems (Freund et
plane exists (i.e., the data is “separable”). Moreover,
al. 1998), the Markov Random Field methods of
the number of mistakes made will be low, providing
(Johnson et al. 1999), and the boosting approaches
that the data is separable with “large margin”, and
for parsing in (Collins 2000). The set-up is as fol-
3
lows: To find such a hyperplane the algorithm must be run over
Training data is a set of example input/output the training set repeatedly until no mistakes are made. The al-
For i :::n
j = argmax F (x ) j = argmax =1 G(x )
If (j 6= 1) Then w = w + h(x ) h(x ) If (j 6= 1) Then = + 1
j =1:::ni ij j :::ni ij
i1 ij ij ij
Figure 1: a) The perceptron algorithm for ranking problems. b) The algorithm in dual form.
this translates to guarantees about how the method inner product between two examples is k . The run-
generalizes to test examples. (Freund & Schapire ning time of the algorithm in figure 1(b) is O T nk . ( )
1999) give theorems showing that the voted per- This follows because throughout the algorithm the
ceptron (a variant described below) generalizes well number of non-zero dual parameters is bounded by
even given non-separable data. ()
n, and hence the calculation of G x takes at most
( )
O nk time. (Note that the dual form algorithm runs
3.3 The Algorithm in Dual Form in quadratic time in the number of training examples
Figure 1(b) shows an equivalent algorithm to the n, because T n.)
perceptron, an algorithm which we will call the The dual algorithm is therefore more efficient in
“dual form” of the perceptron. The dual-form al- cases where nk << d. This might seem unlikely to
gorithm does not store a parameter vector w, in- be the case – naively, it would be expected that the
stead storing a set of dual parameters, i;j for i = time to calculate the inner product h x h y be- () ()
1 : : : n; j =2: : : ni . The score for a parse x is de- tween two vectors to be at least O d . But it turns ()
fined by the dual parameters as out that for some high-dimensional representations
G(x) =
X (h(x ) h(x) h(x ) h(x)) the inner product can be calculated in much bet-
()
ter than O d time, making the dual form algorithm
i;j i1 ij
This is in contrast to F x ( )= ()
w h x , the score in
form algorithm goes back to (Aizerman et al. 64).
See (Cristianini and Shawe-Taylor 2000) for more
the original algorithm. explanation of the algorithm.
In spite of these differences the algorithms
give identical results on training and test exam- 3.4 The Voted Perceptron
P
ples: to see this, it can be verified that w = (Freund & Schapire 1999) describe a refinement of
(( ) ( ))
i;j h xi1 h xij , and hence that G x ( )= the perceptron algorithm, the “voted perceptron”.
()
i;j
F x , throughout training. They give theory which suggests that the voted per-
The important difference between the algorithms ceptron is preferable in cases of noisy or unsepara-
lies in the analysis of their computational complex- ble data. The training phase of the algorithm is un-
=P
ity. Say T is the size of the training set, i.e., changed – the change is in how the method is applied
T i
ni . Also, take d to be the dimensional- to test examples. The algorithm in figure 1(b) can be
ity of the parameter vector w. Then the algorithm considered to build a series of hypotheses Gt x , for ()
( )
in figure 1(a) takes O T d time.4 This follows be- t=1 : : : n, where Gt is defined as
()
cause F x must be calculated for each member of X
the training set, and each calculation of F involves G (x) =
t
i;j (h(x ) h(x)
i1 h(x ij ) h(x))
()
O d time. Now say the time taken to compute the (i t;j )
()
If the vectors h x are sparse, then d can be taken to be the
() G
4 t
is the scoring function from the algorithm trained
number of non-zero elements of h, assuming that it takes O d
()
time to add feature vectors with O d non-zero elements, or to on just the first t training examples. The output of a
take inner products. model trained on the first t examples for a sentence s
a) S The key to our efficient use of this representa-
NP VP
tion is a dynamic programming algorithm that com-
putes the inner product between two examples x1
N V NP and x2 in polynomial (in the size of the trees in-
John saw D N ()
volved), rather than O d , time. The algorithm is
described in (Collins and Duffy 2001), but for com-
the man
pleteness we repeat it here. We first define the set
of nodes in trees x1 and x2 as N1 and N2 respec-
()
b) NP NP D N NP NP
tively. We define the indicator function Ii n to be
D N D N D N D N
1
( )=P
if sub-tree i is seen rooted at node n and 0 other-
the man
( )
( )=P
the man the man wise. It follows that hi x1 n1 2N1
Ii n1 and
hi x2 n2 2N2
( )
Ii n2 . The first step to efficient
Figure 2: a) An example parse tree. b) The sub-trees of the NP computation of the inner product is the following
P (x ) (x )
covering the man. The tree in (a) contains all of these subtrees, property:
h(x1 ) P
h(x2P) =
= P ( P2 ( P))(P 2 ( ))
as well as many others.
hi 1 hi 2
( ) = arg max ()
i
is V st
x2C (s) G x . Thus the training
t
Ii n1 Ii n2
=P 2 P 2 ( ) ( )
i n1 N1 n2 N2
= 2 ( )
n1 N1 n2 N2 i
I (n )I (n ).
n1 ; n2
()
V t s for t =1
: : : n. The voted perceptron picks Next, we note that (
1 2
)
n1 ; n2 can be computed ef-
i i 1 i 2
1 2 1 2
4 A Tree Kernel j =1
Each tree is represented by a d dimensional vector the number of common subtrees that are found
where the i’th component counts the number of oc- rooted at both n1 and n2 . The first two cases are
curences of the i’th tree fragment. Define the func- trivially correct. The last, recursive, definition fol-
()
tion hi x to be the number of occurences of the i’th lows because a common subtree for n1 and n2 can
tree fragment in tree x, so that x is now represented be formed by taking the production at n1 /n2 , to-
as h x( )=( ( ) ( )
h1 x ; h2 x ; : : : ; hd x . Note that d ( )) gether with a choice at each child of simply tak-
will be huge (a given tree will have a number of sub- ing the non-terminal at that child, or any one of
trees that is exponential in its size). Because of this the common sub-trees at that child. Thus there are
we aim to design algorithms whose computational 5
Pre-terminals are nodes directly above words in the surface
complexity is independent of d. string, for example the N, V, and D symbols in Figure 2.
N N V N P N A tagged sequence is a sequence of word/state
a)
Lou Gerstner is chairman of IBM pairs x = fw1 =s1 : : : wn=sng where wi is the i’th
word, and si is the tag for that word. The par-
b)
N N N V N N V
... ticular representation we consider is similar to the
Lou Lou Gerstner is
all sub-trees representation for trees. A tagged-
sequence “fragment” is a subgraph that contains a
Figure 3: a) A tagged sequence. b) Example “fragments” subsequence of state labels, where each label may
of the tagged sequence: the tagging kernel is sensitive to the or may not contain the word below it. See figure 3
counts of all such fragments. for an example. Each tagged sequence is represented
by a d dimensional vector where the i’th component
(1 + (
hild(n ; i);
hild(n ; i))) possible choices
1 2 ()
hi x counts the number of occurrences of the i’th
at the i’th child. (Note that a similar recursion is de- fragment in x.
scribed by Goodman (Goodman 1996), Goodman’s The inner product under this representation can
application being the conversion of Bod’s model be calculated using dynamic programming in a very
(Bod 1998) to an equivalent PCFG.) similar way to the tree algorithm. We first define
P It is clear from the identity h x1 h x2 ( ) ( ) = the set of states in tagged sequences x1 and x2 as
( )
n1 ; n2 , and the recursive definition of N1 and N2 respectively. Each state has an asso-
( ) ( ) ( )
n1 ;n2
I (n )I (n ).
N2
n1 ; n2
1 2 1 2
( )
i i
6.2 Named–Entity Extraction which keeps the top 20 hypotheses at each stage of
a left-to-right search. In training the voted percep-
Over a period of a year or so we have had over one
tron we split the training data into a 41,992 sen-
million words of named-entity data annotated. The
tence training set, and a 11,617 sentence develop-
data is drawn from web pages, the aim being to sup-
ment set. The training set was split into 5 portions,
port a question-answering system over web data. A
and in each case the maximum-entropy tagger was
number of categories are annotated: the usual peo-
trained on 4/5 of the data, then used to decode the
ple, organization and location categories, as well as
remaining 1/5. In this way the whole training data
less frequent categories such as brand-names, scien-
was decoded. The top 20 hypotheses under a beam
tific terms, event titles (such as concerts) and so on.
search, together with their log probabilities, were re-
As a result, we created a training set of 53,609 sen-
covered for each training sentence. In a similar way,
tences (1,047,491 words), and a test set of 14,717
a model trained on the 41,992 sentence set was used
sentences (291,898 words).
to produce 20 hypotheses for each sentence in the
The task we consider is to recover named-entity
development set.
boundaries. We leave the recovery of the categories
As in the parsing experiments, the final kernel in-
of entities to a separate stage of processing. We eval-
corporates the probability from the maximum en-
uate different methods on the task through precision
()
tropy tagger, i.e. h2 x h2 y ( ) = ( ) ( )+
L x L y
and recall.7 The problem can be framed as a tag-
() () ()
h x h y where L x is the log-likelihood of x
ging task – to tag each word as being either the start
() ()
under the tagging model, h x h y is the tagging
kernel described previously, and is a parameter
of an entity, a continuation of an entity, or not to
be part of an entity at all. As a baseline model we
weighting the two terms. The other free parame-
ter in the kernel is , which determines how quickly
used a maximum entropy tagger, very similar to the
one described in (Ratnaparkhi 1996). Maximum en-
larger structures are downweighted. In running sev-
tropy taggers have been shown to be highly com-
eral training runs with different parameter values,
petitive on a number of tagging tasks, such as part-
and then testing error rates on the development set,
of-speech tagging (Ratnaparkhi 1996), and named-
the best parameter values we found were = 02
:,
entity recognition (Borthwick et. al 1998). Thus
the maximum-entropy tagger we used represents a
= 05 : . Figure 5 shows results on the test data
for the baseline maximum-entropy tagger, and the
serious baseline for the task. We used a feature
voted perceptron. The results show a 15.6% relative
set which included the current, next, and previous
improvement in F-measure.
word; the previous two tags; various capitalization
and other features of the word being tagged (the full 7 Relationship to Previous Work
feature set is described in (Collins 2002a)).
As a baseline we trained a model on the full (Bod 1998) describes quite different parameter esti-
53,609 sentences of training data, and decoded the mation and parsing methods for the DOP represen-
14,717 sentences of test data using a beam search tation. The methods explicitly deal with the param-
eters associated with subtrees, with sub-sampling of
7
If a method proposes p entities on the test set, and
of tree fragments making the computation manageable.
these are correct then the precision of a method is100%
=p.
Even after this, Bod’s method is left with a huge
Similarly, if g is the number of entities in the human annotated
100%
version of the test set, then the recall is
=g . grammar: (Bod 2001) describes a grammar with
over 5 million sub-structures. The method requires Bod, R. (1998). Beyond Grammar: An Experience-Based The-
search for the 1,000 most probable derivations un- ory of Language. CSLI Publications/Cambridge University
Press.
der this grammar, using beam search, presumably a Bod, R. (2001). What is the Minimal Set of Fragments that
challenging computational task given the size of the Achieves Maximal Parse Accuracy? In Proceedings of ACL
grammar. In spite of these problems, (Bod 2001) 2001.
Borthwick, A., Sterling, J., Agichtein, E., and Grishman, R.
gives excellent results for the method on parsing (1998). Exploiting Diverse Knowledge Sources via Maxi-
Wall Street Journal text. The algorithms in this paper mum Entropy in Named Entity Recognition. Proc. of the
have a different flavor, avoiding the need to explic- Sixth Workshop on Very Large Corpora.
itly deal with feature vectors that track all subtrees, Charniak, E. (2000). A maximum-entropy-inspired parser. In
Proceedings of NAACL 2000.
and also avoiding the need to sum over an exponen- Collins, M. 1999. Head-Driven Statistical Models for Natural
tial number of derivations underlying a given tree. Language Parsing. PhD Dissertation, University of Pennsyl-
(Goodman 1996) gives a polynomial time con- vania.
Collins, M. (2000). Discriminative Reranking for Natural Lan-
version of a DOP model into an equivalent PCFG guage Parsing. Proceedings of the Seventeenth International
whose size is linear in the size of the training set. Conference on Machine Learning (ICML 2000).
The method uses a similar recursion to the common Collins, M., and Duffy, N. (2001). Convolution Kernels for Nat-
ural Language. In Proceedings of Neural Information Pro-
sub-trees recursion described in this paper. Good- cessing Systems (NIPS 14).
man’s method still leaves exact parsing under the Collins, M. (2002a). Ranking Algorithms for Named–Entity
model intractable (because of the need to sum over Extraction: Boosting and the Voted Perceptron. In Proceed-
multiple derivations underlying the same tree), but ings of ACL 2002.
Collins, M. (2002b). Discriminative Training Methods for Hid-
he gives an approximation to finding the most prob- den Markov Models: Theory and Experiments with the Per-
able tree, which can be computed efficiently. ceptron Algorithm. In Proceedings of EMNLP 2002.
From a theoretical point of view, it is difficult to Cristianini, N., and Shawe-Tayor, J. (2000). An introduction to
Support Vector Machines and other kernel-based learning
find motivation for the parameter estimation meth- methods. Cambridge University Press.
ods used by (Bod 1998) – see (Johnson 2002) for Freund, Y. & Schapire, R. (1999). Large Margin Classifica-
discussion. In contrast, the parameter estimation tion using the Perceptron Algorithm. In Machine Learning,
37(3):277–296.
methods in this paper have a strong theoretical basis
Freund, Y., Iyer, R.,Schapire, R.E., & Singer, Y. (1998). An effi-
(see (Cristianini and Shawe-Taylor 2000) chapter 2 cient boosting algorithm for combining preferences. In Ma-
and (Freund & Schapire 1999) for statistical theory chine Learning: Proceedings of the Fifteenth International
underlying the perceptron). Conference. San Francisco: Morgan Kaufmann.
Goodman, J. (1996). Efficient algorithms for parsing the DOP
For related work on the voted perceptron algo- model. In Proceedings of the Conference on Empirical Meth-
rithm applied to NLP problems, see (Collins 2002a) ods in Natural Language Processing, pages 143-152.
and (Collins 2002b). (Collins 2002a) describes ex- Haussler, D. (1999). Convolution Kernels on Discrete Struc-
tures. Technical report, University of Santa Cruz.
periments on the same named-entity dataset as in
Johnson, M., Geman, S., Canon, S., Chi, S., & Riezler, S.
this paper, but using explicit features rather than ker- (1999). Estimators for stochastic ‘unification-based” gram-
nels. (Collins 2002b) describes how the voted per- mars. In Proceedings of the 37th Annual Meeting of the As-
sociation for Computational Linguistics.
ceptron can be used to train maximum-entropy style
Johnson, M. (2002). The DOP estimation method is biased and
taggers, and also gives a more thorough discussion inconsistent. Computational Linguistics, 28, 71-76.
of the theory behind the perceptron algorithm ap- Lodhi, H., Christianini, N., Shawe-Taylor, J., & Watkins, C.
plied to ranking tasks. (2001). Text Classification using String Kernels. In Advances
in Neural Information Processing Systems 13, MIT Press.
Acknowledgements Many thanks to Jack Minisi for Marcus, M., Santorini, B., & Marcinkiewicz, M. (1993). Build-
annotating the named-entity data used in the exper- ing a large annotated corpus of english: The Penn treebank.
iments. Thanks to Rob Schapire and Yoram Singer Computational Linguistics, 19, 313-330.
for many useful discussions. Ratnaparkhi, A. (1996). A maximum entropy part-of-speech
tagger. In Proceedings of the empirical methods in natural
language processing conference.
References Rosenblatt, F. 1958. The Perceptron: A Probabilistic Model for
Aizerman, M., Braverman, E., & Rozonoer, L. (1964). Theoret- Information Storage and Organization in the Brain. Psycho-
ical Foundations of the Potential Function Method in Pattern logical Review, 65, 386–408. (Reprinted in Neurocomputing
Recognition Learning. In Automation and Remote Control, (MIT Press, 1998).)
25:821–837.