New Ranking Algorithms For Parsing and Tagging: Kernels Over Discrete Structures, and The Voted Perceptron

New Ranking Algorithms for Parsing and Tagging:
Kernels over Discrete Structures, and the Voted Perceptron

Michael Collins Nigel Duffy
AT&T Labs-Research, iKuni Inc.,
Florham Park, 3400 Hillview Ave., Building 5,
New Jersey. Palo Alto, CA 94304.
mcollins@research.att.com nigeduff@cs.ucsc.edu
Abstract “kernel” trick ((Cristianini and Shawe-Taylor 2000)

discuss kernel methods at length). We describe how
This paper introduces new learning al- the inner product between feature vectors in these
gorithms for natural language processing representations can be calculated efficiently using
based on the perceptron algorithm. We dynamic programming algorithms. This leads to
show how the algorithms can be efficiently polynomial time2 algorithms for training and apply-
applied to exponential sized representa- ing the perceptron. The kernels we describe are re-
tions of parse trees, such as the “all sub- lated to the kernels over discrete structures in (Haus-
trees” (DOP) representation described by sler 1999; Lodhi et al. 2001).
(Bod 1998), or a representation tracking A previous paper (Collins and Duffy 2001)
all sub-fragments of a tagged sentence. showed improvements over a PCFG in parsing the
We give experimental results showing sig- ATIS task. In this paper we show that the method
nificant improvements on two tasks: pars- scales to far more complex domains. In parsing Wall
ing Wall Street Journal text, and named- Street Journal text, the method gives a 5.1% relative
entity extraction from web data. reduction in error rate over the model of (Collins
1999). In the second domain, detecting named-
1 Introduction entity boundaries in web data, we show a 15.6% rel-
ative error reduction (an improvement in F-measure
The perceptron algorithm is one of the oldest algo- from 85.3% to 87.6%) over a state-of-the-art model,
rithms in machine learning, going back to (Rosen- a maximum-entropy tagger. This result is derived
blatt 1958). It is an incredibly simple algorithm to using a new kernel, for tagged sequences, described
implement, and yet it has been shown to be com- in this paper. Both results rely on a new approach
petitive with more recent learning methods such as that incorporates the log-probability from a baseline
support vector machines – see (Freund & Schapire model, in addition to the “all-fragments” features.
1999) for its application to image classification, for
example. 2 Feature–Vector Representations of Parse
This paper describes how the perceptron and Trees and Tagged Sequences
voted perceptron algorithms can be used for pars-
ing and tagging problems. Crucially, the algorithms This paper focuses on the task of choosing the cor-
can be efficiently applied to exponential sized repre- rect parse or tag sequence for a sentence from a
sentations of parse trees, such as the “all subtrees” group of “candidates” for that sentence. The candi-
(DOP) representation described by (Bod 1998), or a dates might be enumerated by a number of methods.
representation tracking all sub-fragments of a tagged The experiments in this paper use the top n candi-
sentence. It might seem paradoxical to be able to ef- dates from a baseline probabilistic model: the model
ficiently learn and apply a model with an exponential of (Collins 1999) for parsing, and a maximum-
number of features.1 The key to our algorithms is the entropy tagger for named-entity recognition.
1 2
Although see (Goodman 1996) for an efficient algorithm i.e., polynomial in the number of training examples, and
for the DOP model, which we discuss in section 7 of this paper. the size of trees or sentences in training and test data.
The choice of representation is central: what fea- where each si is a sentence and each ti is the correct
tures should be used as evidence in choosing be- tree for that sentence.
tween candidates? We will use a function h x 2 () We assume some way of enumerating a set of
R d to denote a d-dimensional feature vector that rep- candidates for a particular sentence. We use xij to
resents a tree or tagged sequence x. There are many denote the j ’th candidate for the i’th sentence in
()
possibilities for h x . An obvious example for parse training data, and C si ( )= fxi1 ; xi2 : : :g to denote
()
trees is to have one component of h x for each the set of candidates for si .
rule in a context-free grammar that underlies the Without loss of generality we take xi1 to be the
trees. This is the representation used by Stochastic correct candidate for si (i.e., xi1 ti ).=
Context-Free Grammars. The feature vector tracks Each candidate xij is represented by a feature
the counts of rules in the tree x, thus encoding the ( )
vector h xij in the space R d . The parameters of
sufficient statistics for the SCFG. the model are also a vector w 2 R d . The out-
Given a representation, and two structures x and put of the model on a training or test example s is
y, the inner product between the structures can be argmax ()
x2C (s) w h x .
defined as The key question, having defined a representation
h, is how to set the parameters w. We discuss one
h(x) h(y) =
X h (x)h (y)
d
method for setting the weights, the perceptron algo-
i i
i=1
rithm, in the next section.
The idea of inner products between feature vectors 3.2 The Perceptron Algorithm
is central to learning algorithms such as Support Figure 1(a) shows the perceptron algorithm applied
Vector Machines (SVMs), and is also central to the to the ranking task. The method assumes a training
ideas in this paper. Intuitively, the inner product set as described in section 3.1, and a representation
is a similarity measure between objects: structures h of parse trees. The algorithm maintains a param-
with similar feature vectors will have high values for eter vector w, which is initially set to be all zeros.
() ()
h x h y . More formally, it has been observed that The algorithm then makes a pass over the training
many algorithms can be implemented using inner set, only updating the parameter vector when a mis-
products between training examples alone, without take is made on an example. The parameter vec-
direct access to the feature vectors themselves. As tor update is very simple, involving adding the dif-
we will see in this paper, this can be crucial for the ference of the offending examples’ representations
efficiency of learning with certain representations. (w = + ( ) ( )
w h xi1 h xij in the figure). Intu-
Following the SVM literature, we call a function
( )
itively, this update has the effect of increasing the
K x; y of two objects x and y a “kernel” if it can parameter values for features in the correct tree, and
be shown that K is an inner product in some feature downweighting the parameter values for features in
space h. the competitor.
See (Cristianini and Shawe-Taylor 2000) for dis-
3 Algorithms
cussion of the perceptron algorithm, including an
3.1 Notation overview of various theorems justifying this way of
setting the parameters. Briefly, the perceptron algo-
This section formalizes the idea of linear models for
rithm is guaranteed3 to find a hyperplane that cor-
parsing or tagging. The method is related to the
rectly classifies all training points, if such a hyper-
boosting approach to ranking problems (Freund et
plane exists (i.e., the data is “separable”). Moreover,
al. 1998), the Markov Random Field methods of
the number of mistakes made will be low, providing
(Johnson et al. 1999), and the boosting approaches
that the data is separable with “large margin”, and
for parsing in (Collins 2000). The set-up is as fol-
3
lows: To find such a hyperplane the algorithm must be run over
Training data is a set of example input/output the training set repeatedly until no mistakes are made. The al-
pairs. In parsing the training examples are fsi ; ti g

gorithm in figure 1 includes just a single pass over the training
set.
(a) Define:
F (x) = w h(x).
(b)Define:
G(x) =
P (i;j ) (h(x ) h(x)
h(x ) h(x))
i1
=0 Initialization: Set dual parameters = 0

i;j ij
Initialization: Set parameters w

=1 For i = 1 : : : n
i;j
For i :::n
j = argmax F (x ) j = argmax =1 G(x )
If (j 6= 1) Then w = w + h(x ) h(x ) If (j 6= 1) Then = + 1
j =1:::ni ij j :::ni ij
i1 ij ij ij
Output on test sentence s: Output on test sentence s:

argmaxx2C F (x). (s) argmaxx2C( ) G(x): s
Figure 1: a) The perceptron algorithm for ranking problems. b) The algorithm in dual form.
this translates to guarantees about how the method inner product between two examples is k . The run-
generalizes to test examples. (Freund & Schapire ning time of the algorithm in figure 1(b) is O T nk . ( )
1999) give theorems showing that the voted per- This follows because throughout the algorithm the
ceptron (a variant described below) generalizes well number of non-zero dual parameters is bounded by
even given non-separable data. ()
n, and hence the calculation of G x takes at most
( )
O nk time. (Note that the dual form algorithm runs
3.3 The Algorithm in Dual Form in quadratic time in the number of training examples
Figure 1(b) shows an equivalent algorithm to the n, because T n.)
perceptron, an algorithm which we will call the The dual algorithm is therefore more efficient in
“dual form” of the perceptron. The dual-form al- cases where nk << d. This might seem unlikely to
gorithm does not store a parameter vector w, in- be the case – naively, it would be expected that the
stead storing a set of dual parameters, i;j for i = time to calculate the inner product h x h y be- () ()
1 : : : n; j =2: : : ni . The score for a parse x is de- tween two vectors to be at least O d . But it turns ()
fined by the dual parameters as out that for some high-dimensional representations
G(x) =
X (h(x ) h(x) h(x ) h(x)) the inner product can be calculated in much bet-
()
ter than O d time, making the dual form algorithm
i;j i1 ij
(i;j ) more efficient than the original algorithm. The dual-
This is in contrast to F x ( )= ()
w h x , the score in
form algorithm goes back to (Aizerman et al. 64).
See (Cristianini and Shawe-Taylor 2000) for more
the original algorithm. explanation of the algorithm.
In spite of these differences the algorithms
give identical results on training and test exam- 3.4 The Voted Perceptron
P
ples: to see this, it can be verified that w = (Freund & Schapire 1999) describe a refinement of
(( ) ( ))
i;j h xi1 h xij , and hence that G x ( )= the perceptron algorithm, the “voted perceptron”.
()
i;j
F x , throughout training. They give theory which suggests that the voted per-
The important difference between the algorithms ceptron is preferable in cases of noisy or unsepara-
lies in the analysis of their computational complex- ble data. The training phase of the algorithm is un-
=P
ity. Say T is the size of the training set, i.e., changed – the change is in how the method is applied
T i
ni . Also, take d to be the dimensional- to test examples. The algorithm in figure 1(b) can be
ity of the parameter vector w. Then the algorithm considered to build a series of hypotheses Gt x , for ()
( )
in figure 1(a) takes O T d time.4 This follows be- t=1 : : : n, where Gt is defined as
()
cause F x must be calculated for each member of X
the training set, and each calculation of F involves G (x) =
t
i;j (h(x ) h(x)
i1 h(x ij ) h(x))
()
O d time. Now say the time taken to compute the (i t;j )
()
If the vectors h x are sparse, then d can be taken to be the
() G
4 t
is the scoring function from the algorithm trained
number of non-zero elements of h, assuming that it takes O d
()
time to add feature vectors with O d non-zero elements, or to on just the first t training examples. The output of a
take inner products. model trained on the first t examples for a sentence s
a) S The key to our efficient use of this representa-
NP VP
tion is a dynamic programming algorithm that com-
putes the inner product between two examples x1
N V NP and x2 in polynomial (in the size of the trees in-
John saw D N ()
volved), rather than O d , time. The algorithm is
described in (Collins and Duffy 2001), but for com-
the man
pleteness we repeat it here. We first define the set
of nodes in trees x1 and x2 as N1 and N2 respec-
()
b) NP NP D N NP NP
tively. We define the indicator function Ii n to be
D N D N D N D N
1
( )=P
if sub-tree i is seen rooted at node n and 0 other-
the man
( )
( )=P
the man the man wise. It follows that hi x1 n1 2N1
Ii n1 and
hi x2 n2 2N2
( )
Ii n2 . The first step to efficient
Figure 2: a) An example parse tree. b) The sub-trees of the NP computation of the inner product is the following
P (x ) (x )
covering the man. The tree in (a) contains all of these subtrees, property:
h(x1 ) P
h(x2P) =
= P ( P2 ( P))(P 2 ( ))
as well as many others.
hi 1 hi 2
( ) = arg max ()
i
is V st
x2C (s) G x . Thus the training
t
Ii n1 Ii n2
=P 2 P 2 ( ) ( )
i n1 N1 n2 N2
algorithm can be considered to construct a sequence Ii n1 Ii n2
= 2 ( )
n1 N1 n2 N2 i
of n models, V 1 : : : V n . On a test sentence s, each

of these n functions will return its own parse tree,
n1 2 N1
where we define (n ; n ) =

P n2 N2
I (n )I (n ).
n1 ; n2
()
V t s for t =1
: : : n. The voted perceptron picks Next, we note that (
1 2
)
n1 ; n2 can be computed ef-
i i 1 i 2
the most likely tree as that which occurs most often

() () ()
ficiently, due to the following recursive definition:
in the set fV 1 s ; V 2 s : : : V n s g. If the productions at n1 and n2 are different
(
Note that Gt is easily derived from Gt 1 , n1 ; n2 )=0
() = ()+
.
P
through the identity Gt x
( ( ) ( ) ( ) ( ))
t;j h xt1 h x h xtj h x .
Gt 1 x If the productions at n1 and n2 are the same, and
nt
j =2
Be- n1 and n2 are pre-terminals, then n1 ; n2 .5 ( )=1
cause of this the voted perceptron can be imple- Else if the productions at n1 and n2 are the same
mented with the same number of kernel calculations, and n1 and n2 are not pre-terminals,
and hence roughly the same computational complex-
ity, as the original perceptron.
(n ; n ) =
Y (1 + ( h(n ; j); h(n ; j))) ;
n (n1 )
1 2 1 2
4 A Tree Kernel j =1
We now consider a representation that tracks all sub- ( )

where n n1 is the number of children of n1 in the
trees seen in training data, the representation stud- tree; because the productions at n1 /n2 are the same,
ied extensively by (Bod 1998). See figure 2 for we have n n1 ( )= ( ) n n2 . The i’th child-node of
an example. Conceptually we begin by enumer- ( )
n1 is h n1 ; i .
)=P ( ) ( )
ating all tree fragments that occur in the training To see that this recursive definition is correct, note
1
data : : : d. Note that this is done only implicitly. that (
n1 ; n2 I n1 Ii n2 simply counts
i i
Each tree is represented by a d dimensional vector the number of common subtrees that are found
where the i’th component counts the number of oc- rooted at both n1 and n2 . The first two cases are
curences of the i’th tree fragment. Define the func- trivially correct. The last, recursive, definition fol-
()
tion hi x to be the number of occurences of the i’th lows because a common subtree for n1 and n2 can
tree fragment in tree x, so that x is now represented be formed by taking the production at n1 /n2 , to-
as h x( )=( ( ) ( )
h1 x ; h2 x ; : : : ; hd x . Note that d ( )) gether with a choice at each child of simply tak-
will be huge (a given tree will have a number of sub- ing the non-terminal at that child, or any one of
trees that is exponential in its size). Because of this the common sub-trees at that child. Thus there are
we aim to design algorithms whose computational 5
Pre-terminals are nodes directly above words in the surface
complexity is independent of d. string, for example the N, V, and D symbols in Figure 2.
N N V N P N A tagged sequence is a sequence of word/state
a)
Lou Gerstner is chairman of IBM pairs x = fw1 =s1 : : : wn=sng where wi is the i’th
word, and si is the tag for that word. The par-
b)
N N N V N N V
... ticular representation we consider is similar to the
Lou Lou Gerstner is
all sub-trees representation for trees. A tagged-
sequence “fragment” is a subgraph that contains a
Figure 3: a) A tagged sequence. b) Example “fragments” subsequence of state labels, where each label may
of the tagged sequence: the tagging kernel is sensitive to the or may not contain the word below it. See figure 3
counts of all such fragments. for an example. Each tagged sequence is represented
by a d dimensional vector where the i’th component
(1 + ( hild(n ; i); hild(n ; i))) possible choices
1 2 ()
hi x counts the number of occurrences of the i’th
at the i’th child. (Note that a similar recursion is de- fragment in x.
scribed by Goodman (Goodman 1996), Goodman’s The inner product under this representation can
application being the conversion of Bod’s model be calculated using dynamic programming in a very
(Bod 1998) to an equivalent PCFG.) similar way to the tree algorithm. We first define
P It is clear from the identity h x1 h x2 ( ) ( ) = the set of states in tagged sequences x1 and x2 as
( )
n1 ; n2 , and the recursive definition of N1 and N2 respectively. Each state has an asso-
( ) ( ) ( )
n1 ;n2
n1 ; n2 , that h x1 h x2 can be calculated in

( )
O jN1 jjN2 j time: the matrix of n1 ; n2 values ( ) ciated label and an associated word. We define
() 1
the indicator function Ii n to be if fragment i
can be filled in, then summed.6
( )=P
is seen with left-most state at node n, and 0 other-
( )
( )=P
Since there will be many more tree fragments
wise. It follows that hi x1 n1 2N1
Ii n1 and
of larger size – say depth four versus depth three
– it makes sense to downweight the contribu-
hi x2 n2 2N2
I i ( )
n 2 . As before, some simple
tion of larger tree fragments to the kernel. This
< 0
algebra shows that
h(x1 ) h(x2 ) =
P 2 ( 2
P)
can be achieved by introducing a parameter
1
, and modifying the base case and re- where we define (n ; n ) =
Pn1 N1 n2
I (n )I (n ).
N2
n1 ; n2

1 2 1 2
( )
i i
Next, for any given state n1 2 N1 define next n1

i
cursive case of the definitions of to be re-
Q
spectively ( ) =
n1 ; n2 ( ) =
and n1 ; n2 to be the state to the right of n1 in the structure
jn =1(n1 ) (1 + ( ( ) ( )))
h n1 ; j ; h n2 ; j . This cor- x1 . An analogous definition holds for next n2 . ( )
( ) ( ) =
responds to a modified kernel h x1 h x2
P Then ( )
n1 ; n2 can be computed using dynamic
( ) ( )
sizei hi x1 hi x2 where sizei is the number of programming, due to a recursive definition:
If the state labels at n1 and n2 are different
i
rules in the i’th fragment. This is roughly equiva-

lent to having a prior that large sub-trees will be less ( n1 ; n2 )=0 .
useful in the learning task. If the state labels at n1 and n2 are the same,
but the words at n1 and n2 are different, then
5 A Tagging Kernel ( n1 ; n2 ) = 1 + ( ( )
next n1 ; next n2 . ( ))
The second problem we consider is tagging, where Else if the state labels at n1 and n2 are the
same, and the words at n1 and n2 are the same, then
( ) = 2 + 2 ( ( )) ( )
each word in a sentence is mapped to one of a finite
set of tags. The tags might represent part-of-speech n1 ; n2 next n1 ; next n2 .
There are a couple of useful modifications to this
0 1
tags, named-entity boundaries, base noun-phrases,
or other structures. In the experiments in this paper kernel. One is to introduce a parameter <
which penalizes larger substructures. The recur-
( ) =
we consider named-entity recognition.
sive definitions are modfied to be n1 ; n2
1 + ( ( ) ( )) ( ) = 2 +
6
This can be a pessimistic estimate of the runtime. A more
next n1 ; next n2 and n1 ; n2
useful characterization is that it runs in time linear in the number
(
of members n1 ; n2 )2 N1 N2 such that the productions at 2 ( ( ) P ( ))
next n1 ; next n2 respectively. This gives
n1 and n2 are the same. In our data we have found the number
( ) ( )
an inner product i sizei hi x1 hi x2 where sizei
is the number of state labels in the ith fragment.
of nodes with identical productions to be approximately linear
in the size of the trees, so the running time is also close to linear
in the size of the trees. Another useful modification is as follows. Define
MODEL 40 Words (2245 sentences)
LR LP CBs CBs 0 CBs 2 trained on the remaining 34,000 sentences (this pre-
vented the initial model from being unrealistically
CO99 88.5% 88.7% 0.92 66.7% 87.1%
VP 89.1% 89.4% 0.85 69.3% 88.2% “good” on the training sentences). The 4,000 devel-
MODEL
100 Words (2416 sentences) opment sentences were parsed with a model trained
LR LP CBs CBs 0 CBs 2 on the 36,000 training sentences. Section 23 was
CO99 88.1% 88.3% 1.06 64.0% 85.1%
VP 88.6% 88.9% 0.99 66.5% 86.3% parsed with a model trained on all 40,000 sentences.
The representation we use incorporates the prob-
Figure 4: Results on Section 23 of the WSJ Treebank. LR/LP
ability from the original model, as well as the
= labeled recall/precision. CBs = average number of crossing
brackets per sentence. 0 CBs, 2 CBs are the percentage of sen- all-subtrees representation. We introduce a pa-
rameter which controls the relative contribu-
2 crossing brackets respectively.
tences with 0 or
model 2 of (Collins 1999). VP is the voted perceptron with the
CO99 is
()
tion of the two terms. If L x is the log prob-
ability of a tree x under the original probability
tree kernel.
model, and h x ()=( () () ( ))
h1 x ; h2 x ; : : : ; hd x is
Sim1 (w1 ; w2 ) for words w1 and w2 to be 1 if w1 = the feature vector under the all subtrees represen-
w2 , 0 otherwise. Define Sim2 (w1 ; w2 ) to be 1 if w1 pL xthen
tation, the new representation is h2 x () =
and w2 share the same word features, 0 otherwise. ( () () () ( ))
; h1 x ; h2 x ; : : : ; hd x , and the inner
For example, Sim2 might be defined to be 1 if w1 product between two examples x and y is h2 x ()
and w2 are both capitalized: in this case Sim2 is ( ) = ( ) ( )+ ( ) ( )
h2 y L x L y h x h y . This allows the
a looser notion of similarity than the exact match perceptron algorithm to use the probability from the
criterion of Sim1 . Finally, the definition of can original model as well as the subtrees information to
be modified to: rank trees. We would thus expect the model to do at
If labels at n =n are different, (n ; n ) = 0.
1 2 1 2
least as well as the original probabilistic model.
Else (n ; n ) = The algorithm in figure 1(b) was applied to the
() ()
problem, with the inner product h2 x h2 y used
1 2
(1 + 0:5Sim (w ; w ) + 0:5Sim (w ; w ))
()
1 1 2 2 1 2
(1 + (next(n ); next(n ))) 1 2
in the definition of G x . The algorithm in 1(b)
runs in approximately quadratic time in the number
where w1 , w2 are the words at n1 and n2 respec-
of training examples. This made it somewhat ex-
tively. This inner product implicitly includes fea-
pensive to run the algorithm over all 36,000 training
tures which track word features, and thus can make
sentences in one pass. Instead, we broke the training
better use of sparse data.
set into 6 chunks of roughly equal size, and trained
6 Experiments 6 separate perceptrons on these data sets. This has
the advantage of reducing training time, both be-
6.1 Parsing Wall Street Journal Text cause of the quadratic dependence on training set
We used the same data set as that described in size, and also because it is easy to train the 6 models
(Collins 2000). The Penn Wall Street Journal tree- in parallel. The outputs from the 6 runs on test ex-
bank (Marcus et al. 1993) was used as training and amples were combined through the voting procedure
test data. Sections 2-21 inclusive (around 40,000 described in section 3.4.
sentences) were used as training data, section 23 Figure 4 shows the results for the voted percep-
was used as the final test set. Of the 40,000 train- tron with the tree kernel. The parameters and
ing sentences, the first 36,000 were used to train 02 03
were set to : and : respectively through tun-
the perceptron. The remaining 4,000 sentences were ing on the development set. The method shows
used as development data, and for tuning parame- a : 0 6% absolute improvement in average preci-
ters of the algorithm. Model 2 of (Collins 1999) was sion and recall (from 88.2% to 88.8% on sentences
used to parse both the training and test data, produc- 100 words), a 5.1% relative reduction in er-
ing multiple hypotheses for each sentence. In or- ror. The boosting method of (Collins 2000) showed
der to gain a representative set of training data, the 89.6%/89.9% recall and precision on reranking ap-
36,000 training sentences were parsed in 2,000 sen- proaches for the same datasets (sentences less than
tence chunks, each chunk being parsed with a model 100 words in length). (Charniak 2000) describes a
different method which achieves very similar per- P R F
formance to (Collins 2000). (Bod 2001) describes Max-Ent 84.4% 86.3% 85.3%
experiments giving 90.6%/90.8% recall and preci- Perc. 86.1% 89.1% 87.6%
sion for sentences of less than 40 words in length, Imp. 10.9% 20.4% 15.6%
using the all-subtrees representation, but using very
different algorithms and parameter estimation meth- Figure 5: Results for the max-ent and voted perceptron meth-
ods from the perceptron algorithms in this paper (see ods. “Imp.” is the relative error reduction given by using the
section 7 for more discussion). perceptron. P = precision, = recall, = F-measure.
R F
6.2 Named–Entity Extraction which keeps the top 20 hypotheses at each stage of
a left-to-right search. In training the voted percep-
Over a period of a year or so we have had over one
tron we split the training data into a 41,992 sen-
million words of named-entity data annotated. The
tence training set, and a 11,617 sentence develop-
data is drawn from web pages, the aim being to sup-
ment set. The training set was split into 5 portions,
port a question-answering system over web data. A
and in each case the maximum-entropy tagger was
number of categories are annotated: the usual peo-
trained on 4/5 of the data, then used to decode the
ple, organization and location categories, as well as
remaining 1/5. In this way the whole training data
less frequent categories such as brand-names, scien-
was decoded. The top 20 hypotheses under a beam
tific terms, event titles (such as concerts) and so on.
search, together with their log probabilities, were re-
As a result, we created a training set of 53,609 sen-
covered for each training sentence. In a similar way,
tences (1,047,491 words), and a test set of 14,717
a model trained on the 41,992 sentence set was used
sentences (291,898 words).
to produce 20 hypotheses for each sentence in the
The task we consider is to recover named-entity
development set.
boundaries. We leave the recovery of the categories
As in the parsing experiments, the final kernel in-
of entities to a separate stage of processing. We eval-
corporates the probability from the maximum en-
uate different methods on the task through precision
()
tropy tagger, i.e. h2 x h2 y ( ) = ( ) ( )+
L x L y
and recall.7 The problem can be framed as a tag-
() () ()
h x h y where L x is the log-likelihood of x
ging task – to tag each word as being either the start
() ()
under the tagging model, h x h y is the tagging
kernel described previously, and is a parameter
of an entity, a continuation of an entity, or not to
be part of an entity at all. As a baseline model we
weighting the two terms. The other free parame-
ter in the kernel is , which determines how quickly
used a maximum entropy tagger, very similar to the
one described in (Ratnaparkhi 1996). Maximum en-
larger structures are downweighted. In running sev-
tropy taggers have been shown to be highly com-
eral training runs with different parameter values,
petitive on a number of tagging tasks, such as part-
and then testing error rates on the development set,
of-speech tagging (Ratnaparkhi 1996), and named-
the best parameter values we found were = 02
:,
entity recognition (Borthwick et. al 1998). Thus
the maximum-entropy tagger we used represents a
= 05 : . Figure 5 shows results on the test data
for the baseline maximum-entropy tagger, and the
serious baseline for the task. We used a feature
voted perceptron. The results show a 15.6% relative
set which included the current, next, and previous
improvement in F-measure.
word; the previous two tags; various capitalization
and other features of the word being tagged (the full 7 Relationship to Previous Work
feature set is described in (Collins 2002a)).
As a baseline we trained a model on the full (Bod 1998) describes quite different parameter esti-
53,609 sentences of training data, and decoded the mation and parsing methods for the DOP represen-
14,717 sentences of test data using a beam search tation. The methods explicitly deal with the param-
eters associated with subtrees, with sub-sampling of
7
If a method proposes p entities on the test set, and of tree fragments making the computation manageable.
these are correct then the precision of a method is100% =p.
Even after this, Bod’s method is left with a huge
Similarly, if g is the number of entities in the human annotated
100%
version of the test set, then the recall is =g . grammar: (Bod 2001) describes a grammar with
over 5 million sub-structures. The method requires Bod, R. (1998). Beyond Grammar: An Experience-Based The-
search for the 1,000 most probable derivations un- ory of Language. CSLI Publications/Cambridge University
Press.
der this grammar, using beam search, presumably a Bod, R. (2001). What is the Minimal Set of Fragments that
challenging computational task given the size of the Achieves Maximal Parse Accuracy? In Proceedings of ACL
grammar. In spite of these problems, (Bod 2001) 2001.
Borthwick, A., Sterling, J., Agichtein, E., and Grishman, R.
gives excellent results for the method on parsing (1998). Exploiting Diverse Knowledge Sources via Maxi-
Wall Street Journal text. The algorithms in this paper mum Entropy in Named Entity Recognition. Proc. of the
have a different flavor, avoiding the need to explic- Sixth Workshop on Very Large Corpora.
itly deal with feature vectors that track all subtrees, Charniak, E. (2000). A maximum-entropy-inspired parser. In
Proceedings of NAACL 2000.
and also avoiding the need to sum over an exponen- Collins, M. 1999. Head-Driven Statistical Models for Natural
tial number of derivations underlying a given tree. Language Parsing. PhD Dissertation, University of Pennsyl-
(Goodman 1996) gives a polynomial time con- vania.
Collins, M. (2000). Discriminative Reranking for Natural Lan-
version of a DOP model into an equivalent PCFG guage Parsing. Proceedings of the Seventeenth International
whose size is linear in the size of the training set. Conference on Machine Learning (ICML 2000).
The method uses a similar recursion to the common Collins, M., and Duffy, N. (2001). Convolution Kernels for Nat-
ural Language. In Proceedings of Neural Information Pro-
sub-trees recursion described in this paper. Good- cessing Systems (NIPS 14).
man’s method still leaves exact parsing under the Collins, M. (2002a). Ranking Algorithms for Named–Entity
model intractable (because of the need to sum over Extraction: Boosting and the Voted Perceptron. In Proceed-
multiple derivations underlying the same tree), but ings of ACL 2002.
Collins, M. (2002b). Discriminative Training Methods for Hid-
he gives an approximation to finding the most prob- den Markov Models: Theory and Experiments with the Per-
able tree, which can be computed efficiently. ceptron Algorithm. In Proceedings of EMNLP 2002.
From a theoretical point of view, it is difficult to Cristianini, N., and Shawe-Tayor, J. (2000). An introduction to
Support Vector Machines and other kernel-based learning
find motivation for the parameter estimation meth- methods. Cambridge University Press.
ods used by (Bod 1998) – see (Johnson 2002) for Freund, Y. & Schapire, R. (1999). Large Margin Classifica-
discussion. In contrast, the parameter estimation tion using the Perceptron Algorithm. In Machine Learning,
37(3):277–296.
methods in this paper have a strong theoretical basis
Freund, Y., Iyer, R.,Schapire, R.E., & Singer, Y. (1998). An effi-
(see (Cristianini and Shawe-Taylor 2000) chapter 2 cient boosting algorithm for combining preferences. In Ma-
and (Freund & Schapire 1999) for statistical theory chine Learning: Proceedings of the Fifteenth International
underlying the perceptron). Conference. San Francisco: Morgan Kaufmann.
Goodman, J. (1996). Efficient algorithms for parsing the DOP
For related work on the voted perceptron algo- model. In Proceedings of the Conference on Empirical Meth-
rithm applied to NLP problems, see (Collins 2002a) ods in Natural Language Processing, pages 143-152.
and (Collins 2002b). (Collins 2002a) describes ex- Haussler, D. (1999). Convolution Kernels on Discrete Struc-
tures. Technical report, University of Santa Cruz.
periments on the same named-entity dataset as in
Johnson, M., Geman, S., Canon, S., Chi, S., & Riezler, S.
this paper, but using explicit features rather than ker- (1999). Estimators for stochastic ‘unification-based” gram-
nels. (Collins 2002b) describes how the voted per- mars. In Proceedings of the 37th Annual Meeting of the As-
sociation for Computational Linguistics.
ceptron can be used to train maximum-entropy style
Johnson, M. (2002). The DOP estimation method is biased and
taggers, and also gives a more thorough discussion inconsistent. Computational Linguistics, 28, 71-76.
of the theory behind the perceptron algorithm ap- Lodhi, H., Christianini, N., Shawe-Taylor, J., & Watkins, C.
plied to ranking tasks. (2001). Text Classification using String Kernels. In Advances
in Neural Information Processing Systems 13, MIT Press.
Acknowledgements Many thanks to Jack Minisi for Marcus, M., Santorini, B., & Marcinkiewicz, M. (1993). Build-
annotating the named-entity data used in the exper- ing a large annotated corpus of english: The Penn treebank.
iments. Thanks to Rob Schapire and Yoram Singer Computational Linguistics, 19, 313-330.
for many useful discussions. Ratnaparkhi, A. (1996). A maximum entropy part-of-speech
tagger. In Proceedings of the empirical methods in natural
language processing conference.
References Rosenblatt, F. 1958. The Perceptron: A Probabilistic Model for
Aizerman, M., Braverman, E., & Rozonoer, L. (1964). Theoret- Information Storage and Organization in the Brain. Psycho-
ical Foundations of the Potential Function Method in Pattern logical Review, 65, 386–408. (Reprinted in Neurocomputing
Recognition Learning. In Automation and Remote Control, (MIT Press, 1998).)
25:821–837.

New Ranking Algorithms For Parsing and Tagging: Kernels Over Discrete Structures, and The Voted Perceptron

Uploaded by

New Ranking Algorithms For Parsing and Tagging: Kernels Over Discrete Structures, and The Voted Perceptron

Uploaded by

New Ranking Algorithms for Parsing and Tagging:

Kernels over Discrete Structures, and the Voted Perceptron

Abstract “kernel” trick ((Cristianini and Shawe-Taylor 2000)

pairs. In parsing the training examples are fsi ; ti g

=0 Initialization: Set dual parameters = 0

Initialization: Set parameters w

Output on test sentence s: Output on test sentence s:

(i;j ) more efficient than the original algorithm. The dual-

algorithm can be considered to construct a sequence Ii n1 Ii n2

of n models, V 1 : : : V n . On a test sentence s, each

where we define (n ; n ) =

the most likely tree as that which occurs most often

We now consider a representation that tracks all sub- ( )

n1 ; n2 , that h x1 h x2 can be calculated in

Next, for any given state n1 2 N1 define next n1

rules in the i’th fragment. This is roughly equiva-

You might also like