Log-Linear Models and Conditional Random Fieldsels
Log-Linear Models and Conditional Random Fieldsels
Contents
1 Likelihood and logistic regression 1.1 Principle of maximum likelihood . . . . . . . . 1.2 Maximum likelihood for Bernoulli distributions 1.3 Conditional likelihood . . . . . . . . . . . . . 1.4 Logistic regression . . . . . . . . . . . . . . . 2 2 3 4 4 6 6 7 8
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
Stochastic gradient training 2.1 Logistic regression gradient . . . . . . . . . . . . . . . . . . . . . 2.2 Gradient ascent, one example at a time . . . . . . . . . . . . . . . 2.3 Properties of stochastic gradient training . . . . . . . . . . . . . .
Log-linear models 10 3.1 The general log-linear model . . . . . . . . . . . . . . . . . . . . 10 3.2 Feature functions . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Conditional random elds 4.1 A typical CRF application . . . . . . . . . 4.2 Linear-chain CRFs in general . . . . . . . . 4.3 Inference algorithms for linear-chain CRFs 4.4 Training CRFs by stochastic gradient ascent 12 . 12 . 13 . 15 . 17
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
Alternative CRF training methods 20 5.1 The Collins perceptron . . . . . . . . . . . . . . . . . . . . . . . 20 5.2 Gibbs sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 5.3 Contrastive divergence . . . . . . . . . . . . . . . . . . . . . . . 23 Tutorials and selected papers 24
1.1
Consider a family of probability distributions dened by a set of parameters . The distributions may be either probability mass functions (pmfs) or probability density functions (pdfs). Suppose we have a random sample drawn from a xed but unknown member of this family. The random sample is a training set of n examples x1 to xn . We assume that the examples are independent so the probability of the set is the product of the probabilities of the individual examples: f (x1 , . . . , xn ; ) =
j
f (xj ; ).
Usually we think of the distribution as xed and the examples xj as unknown, or varying. However, we can think of the training data as xed and consider alternative parameter values. This is the point of view behind the denition of the likelihood function: L(; x1 , . . . , xn ) = f (x1 , . . . , xn ; ). Note that if f (x; ) is a probability mass function, then the likelihood is always less than one, but if f (x; ) is a probability density function, then the likelihood can be greater than one, since densities can be greater than one. 2
The principle of maximum likelihood says that we should use as our model ) that gives the greatest possible probability to the training the distribution f (; data. Formally, = argmax L(; x1 , . . . , xn ). This value is called the maximum likelihood estimator (MLE) of . Note that in general each xj is a vector of values, and is a vector of real-valued parameters. For example, for a Gaussian distribution = , 2 . Notational note: In the expression p(y |x; ) the semicolon indicates that is a parameter, not a random variable that is being conditioned on, even though it is to the right of the vertical bar. Viewed as a mapping, this expression is simply a function of three arguments. Viewed as a probability, it is a property of two random variables. In a Bayesian framework, parameters are also viewed as random variables, and one can write expressions such as p( |x). We are not doing a Bayesian analysis, so we indicate that is not a random variable.
1.2
As a rst example of nding a maximum likelihood estimator, consider the parameter of a Bernoulli distribution. A random variable with this distribution is a formalization of a coin toss. The value of the random variable is 1 with probability and 0 with probability 1 . Let X be a Bernoulli random variable. We have if x = 1 . 1 if x = 0 For mathematical convenience write this as P (X = x) = P (X = x) = x (1 )1x . Suppose the training data are x1 through xn where each xj {0, 1}. We maximize the likelihood function L(; x1 , . . . , xn ) = f (x1 , . . . , xn ; ) = h (1 )nh where h = i xi . The maximization is over the possible values 0 1. We can do the maximization by setting the derivative with respect to equal to zero. The derivative is h (1 )nh = hh1 (1 )nh + h (n h)(1 )nh1 (1) p = h1 (1 )nh1 [h(1 ) (n h)] 3
which has solutions = 0, = 1, and = h/n. The solution which is a maximum is clearly = h/n while = 0 and = 1 are minima. So we have the maximum MLE = h/n. likelihood estimate The log likelihood function is simply the logarithm of the likelihood function. Because logarithm is a monotonic strictly increasing function, maximizing the log likelihood is precisely equivalent to maximizing the likelihood, or to minimizing the negative log likelihood.
1.3
Conditional likelihood
An important extension of the idea of likelihood is conditional likelihood. The conditional likelihood of given data x and y is L(; y |x) = f (y |x; ). Intuitively, y follows a probability distribution that is different for different x, but x itself is never unknown, so there is no need to have a probabilistic model of it. Technically, for each x there is a different distribution f (y |x; ) of y , but all these distributions share the same parameters . Given training data consisting of xi , yi pairs, the principle of maximum con that maximizes the prodditional likelihood says to choose a parameter estimate uct i f (yi |xi ; ). Note that we do not need to assume that the xi are independent in order to justify the conditional likelihood being a product; we just need to assume that the yi are independent conditional on the xi . For any specic value of can be used to predict values for y ; we assume that we never want to predict x, values of x.
1.4
Logistic regression
p = p(y |x; , ) = 1 1 + exp [ +
d j =1
is called logistic regression. We use j to index over the feature values x1 to xd of a single example of dimensionality d, since we use i below to index over training examples 1 to n. The logistic regression model is easier to understand in the form p log =+ j xj . 1p j 4
The ratio p/(1 p) is called the odds of the event y given x, and log[p/(1 p)] is called the log odds. Since probabilities range between 0 and 1, odds range between 0 and + and log odds range unboundedly between and +. A linear expression of the form + j j xj can also take unbounded values, so it is reasonable to use a linear expression as a model for log odds, but not as a model for odds or for probabilities. Essentially, logistic regression is the simplest possible model for a random yes/no outcome that depends linearly on predictors x1 to xd . For each feature j , exp(j xj ) is a multiplicative scaling factor on the odds p/(1 p). If the predictor xj is binary, then exp(j ) is the extra odds of having the outcome y = 1 when xj = 1, compared to when xj = 0. Note that it is acceptable, and indeed often benecial, to include a large number of features in a logistic regression model. Some features may be derived, i.e. computed as deterministic functions of other features. One great advantage of logistic regression in comparison to other classiers is that the training process will nd optimal coefcients for features regardless of whether the features are correlated. Other learning methods, in particular naive Bayes, do not work well when the feature values of training or test examples are correlated. A second major advantage of logistic regression is that it gives well-calibrated probabilities. The numerical values p(y = 1|x) given by a logistic regression model are not just scores where a larger score means that the example x is more likely to have label y = 1; they are meaningful conditional probabilities. This implies that given a set of n test examples with numerical predictions v1 to vn , the number of examples in the set that are truly positive will be close to n i=1 vi , whatever this sum is. Last but not least, a third major advantage of logistic regression is that it is not sensitive to unbalanced training data. What this means is that even if one class (either the positive or negative examples) is much larger than the other (correspondingly, the negative or positive examples), logistic regression training encounters no difculties and the nal classier will still be well-calibrated. The conditional probabilities predicted by the trained classier will range below and above the base rate, i.e. the unconditional probability p(y = 1).
2.1
We shall continue with the special case of logistic regression. Given a single training example that consists of x and y values, the conditional log likelihood is log L( ; x, y ) = log p if y = 1 and log L( ; x, y ) = log(1 p) if y = 0. The goal of training is to maximize the conditional log likelihood. So, let us evaluate its partial derivative with respect to each parameter j . To simplify the following discussion, assume that = 0 and x0 = 1 for every example x from now on. If y = 1 the partial derivative is 1 log p = p j p j while if y = 0 it is 1 log(1 p) = j 1p p . j
Let e = exp[ j j xj ] where the sum ranges from j = 0 to j = d, so p = 1/(1 + e) and 1 p = (1 + e 1)/(1 + e) = e/(1 + e). With this notation we 6
have p = (1)(1 + e)2 e j j [ = (1)(1 + e)2 (e) j = (1)(1 + e) (e)(xj ) 1 e xj = 1+e1+e = p(1 p)xj . So (/j ) log p = (1 p)xj and (/j ) log(1 p) = pxj . Given training examples x1 , y1 to xn , yn , the total partial derivative of the log likelihood with respect to j is (1 pi )xij +
i:yi =1 i:yi =0 2
j xj ]
j
pi xij =
i
(yi pi )xij
where xij is the value of the j th feature of the ith training example. Setting the total partial derivative to zero yields yi xij =
i i
pi xij .
We have one equation of this type for each parameter j . The equations can be used to check the correctness of a trained model.
2.2
There are several sophisticated ways of actually doing the maximization of the total conditional log likelihood, i.e. the conditional log likelihood summed over all training examples xi , yi . However, here we consider a method called stochastic gradient ascent. This method changes the parameter values to increase the log likelihood based on one example at a time. It is called stochastic because the derivative based on a randomly chosen single example is a random approximation to the true derivative based on all the training data. Consider a single training example x, y , where again we drop the subscript i for convenience. Consider the j th parameter for 0 j d. The partial derivative 7
of the log likelihood given this single example is log L( ; x, y ) = (y p)xj j where y = 1 or y = 0. For each j , we increase the log likelihood incrementally by doing the update j := j + (y p)xj . Here is a multiplier called the learning rate that controls the magnitude of the changes to the parameters. Stochastic gradient ascent (or descent, for a minimization problem) is a method that is often useful in machine learning. Experience suggests some heuristics for making it work well in practice. The training examples are sorted in random order, and the parameters are updated for each example sequentially. One complete update for every example is called an epoch. Typically, a small constant number of epochs is used, perhaps 3 to 100 epochs. The learning rate is chosen by trial and error. It can be kept constant across all epochs, e.g. = 0.1 or = 1, or it can be decreased gradually as a function of the epoch number. Because the learning rate is the same for every parameter, it is useful to scale the features xj so that their magnitudes are similar for all j . Given that the feature x0 has constant value 1, it is reasonable to normalize every other feature to have mean zero and variance 1, for example. For the state of the art in guidelines for applying the stochastic gradient idea, see http://leon.bottou.org/ projects/sgd.
2.3
Stochastic gradient ascent (or descent) has some properties that are very useful in practice. First, suppose that xj = 0 for most features j of a training example x. Then updating j based on x can be skipped. This means that the time to do one epoch is O(nf p) where n is the number of training examples, p is the number of features, and f is the average number of nonzero feature values per example. If an example x is the bag-of-words representation of document, then p is the size of the vocabulary but f p is the average length of a document.
Second, suppose that the number n of training examples is very large, as is the case in many modern applications. Then, a stochastic gradient method may converge to good parameter estimates in less than one epoch of training. In contrast, a training method that computes the log likelihood of all data and uses this in the same way regardless of n will be inefcient in how it uses the data. For each example, a stochastic gradient method updates all parameters once. The dual idea is to update one parameter at a time, based on all examples. This method is called coordinate ascent (or descent). For feature j the update rule is j := j +
i
(yi pi )xij .
is The update for the whole parameter vector := + ( yp )T X where the matrix X is the entire training set and the column vector y consists of the 0/1 labels for every training example. Often, coordinate ascent converges too after all slowly to be useful. However, it can be useful to do one update of epochs of stochastic gradient ascent. Regardless of the method used to train a model, it is important to remember that optimizing the model perfectly on the training data usually does not lead to the best possible performance on test examples. There are several reasons for this: The model with best possible performance may not belong to the family of models under consideration. This is an instance of the principle you cannot learn it if you cannot represent it. The training data may not be representative of the test data, i.e. the training and test data may be samples from different populations. Fitting the training data as closely as possible may simply be overtting. The objective function for training, namely log likelihood or conditional log likelihood, may not be the desired objective from an application perspective; for example, the desired objective may be classication accuracy.
3.1
Let x be an example, and let y be a possible label for it. A log-linear model assumes that p(y |x; w) = exp wj Fj (x, y ) Z (x, w)
j y
(3.1)
where the partition function Z (x, w) = x, the label predicted by the model is
exp
Each expression Fj (x, y ) is called a feature-function. Mathematically, log-linear models are very simple: there is one real-valued weight for each feature-function, no more and no fewer. There are several possible justications for the form of the expression (3.1). First, a linear combination j wj Fj (x, y ) can take any positive or negative real value; the exponential makes 10
it positive, like a valid probability. Second, the division makes the results between 0 and 1, i.e. makes them be valid probabilities. Third, the ranking of the probabilities will be the same as the ranking of the linear values. A function of the form exp ak bk = k exp ak is called a softmax function because the exponentials enlarge the bigger ak values compared to the smaller ak values. Other functions have the same property of being similar to the maximum function, but differentiable. Softmax is widely used now, perhaps because its derivative is especially simple; see Section 4.4 below.
3.2
Feature functions
In general, a feature-function can be any real-valued function of both the data space X and the label space Y . Formally, a feature-function is any mapping Fj : X Y R. Often, a feature-function is zero for all values of y except one particular value. Given some attribute of x, we can have a different weight for this attribute and each different label. The weights for these feature-functions can then capture the afnity of this attribute-value for each label. Often, feature-functions are presence/absence indicators, so the value of the feature-function is either 0 or 1. If we have a conventional attribute a(x) with k alternative values, and n classes, we can make kn different features as dened above. With log-linear models, anything and the kitchen sink can be a feature. We can have lots of classes, lots of features, and we can pay attention to different features for different classes. Feature-functions can overlap in arbitrary ways. For example, if x is a word different feature-functions can use attributes of x such as starts with a capital letter, starts with G,, is Graham, is six letters long. Generally we can encode sufxes, prexes, facts from a lexicon, preceding/following punctuation, etc., as features.
11
4.1
To begin, consider an example of a learning task for which a CRF is useful. Given a sentence, the task is to tag each word as noun, verb, adjective, preposition, etc. There is a xed known set of these part-of-speech (POS) tags. Each sentence is a separate training or test example. We will represent a sentence by featurefunctions based on its words. Feature-functions can be very varied: Some feature-functions can be position-specic, e.g. to the beginning or to the end of a sentence, while others can be sums over all positions in a sentence. Some feature-functions can look just at one word, e.g. at its prexes or sufxes. Some features can also use the words one to the left, one to the right, two to the left etc., up to the whole sentence. 12
The highest-accuracy POS taggers currently use over 100,000 feature-functions. An important restriction (that will be explained and justied below) is that each feature-function can depend on only one tag, or on two neighboring tags. POS tagging is an example of what is called a structured prediction task. The goal is to predict a complex label (a sequence of POS tags) for a complex input (an entire sentence). This task is difcult, and signicantly different from a standard classier learning task. There are at least three important sources of difculty. First, too much information would be lost by learning just a per-word classier. Inuences between neighboring tags must be taken into account. Second, different sentences have different lengths, so it is not obvious how to represent all sentences by vectors of the same xed length. Third, the set of all possible sequences of tags constitutes an exponentially large set of labels. A linear conditional random eld is a way to apply a log-linear model to this type of task. Use the bar notation for sequences, so x means a sequence of variable length. Specically, let x be a sequence of n words and let y be a corresponding sequence of n tags. Dene the log-linear model p( y |x ; w ) = 1 exp Z ( x, w) wj Fj ( x, y ).
j
Assume that each feature-function Fj is actually a sum along the sentence, for i = 1 to i = n where n is the length of x : Fj ( x, y ) =
i
fj (yi1 , yi , x , i).
This notation means that each low-level feature-function fj can depend on the whole sentence, the current tag and the previous tag, and the current position i within the sentence. A feature-function fj may depend on only a subset of these four possible inuences. Examples of features are the current tag is NOUN and the current word is capitalized, the word at the start of the sentence is Mr. and the previous tag was SALUTATION.
4.2
Summing each fj over all positions i means that we can have a xed set of featurefunctions Fj for log-linear training, even though the training examples are not xed-length. 13
Training a CRF means nding the weight vector w that gives the best possible prediction y = argmaxy y |x ; w ) p( (4.1)
for each training example x . However, before we can talk about training there are two major inference problems to solve. First, how can we do the argmax computation in Equation 4.1 efciently, for any x and any weights w? This computation is difcult since the number of alternative tag sequences y is exponential. Second, given any x and y we want to evaluate p( y |x ; w ) = 1 exp Z ( x, w) wj Fj ( x, y ).
j
The difculty here is that the denominator again ranges over all tag sequences y : Z ( x, w) = y x, y ). For both these tasks, we will need tricks to exp j wj Fj ( account for all possible y efciently, without enumerating all possible y . The fact that feature-functions can depend on at most two tags, which must be adjacent, makes these tricks exist. The next section explains how to solve the two inference problems just described, and then the following section explains to do training via gradient following. An issue that is the topic of considerable research is the question of which objective function to maximize during training. Often, the objective function used for training is not exactly the function that we really want to maximize on test data. Traditionally we maximize CLL on the training data. However, instead of maximizing CLL we could maximize yes/no accuracy of the entire predicted y , or pointwise conditional log likelihood, or we could minimize mean-squared error if tags are numerical, or some other measure of distance between true and predicted tags. A fundamental question is whether we want to maximize a pointwise objective. For a long sequence, we may have a vanishing chance of predicting the entire tag sequence correctly. The single sequence with highest probability may be very different from the most probable tag at each position.
14
4.3
Lets solve the rst problem above efciently. First note that we can ignore the denominator, and also the exponential inside the numerator. We want to compute y = argmaxy y |x ; w) = argmaxy p(
j
wj Fj ( x, y ).
wj
i
fj (yi1 , yi , x , i) = argmaxy
i
gi (yi1 , yi )
where gi (yi1 , yi ) = j wj fj (yi1 , yi , x , i). Note that the x and i arguments of fj have been dropped in the denition of gi . Each gi is a different function for each i, and depends on w as well as on x and i. Remember that each entry of the y vector is one of a nite set of tags. Given x , w, and i the function gi can be represented as an m by m matrix where m is the cardinality of the set of tags. Let v range over the tags. Dene U (k, v ) to be the score of the best sequence of tags from 1 to k , where tag k is required to be v . This is a maximization over k 1 tags because tag number k is xed to have value v . Formally,
k 1
U (k, v ) =
{y1 ,...,yk1 }
max
[
i=1
Now we can write down a recurrence that lets us compute U (k, v ) efciently: U (k, v ) = max [U (k 1, yk1 ) + gk (yk1 , v )]
yk1
With this recurrence we can compute y for any x in O(m2 n) time, where n is the length of x and m is the cardinality of the set of tags. This algorithm is a variation of the Viterbi algorithm for computing the highest-probability path through a hidden Markov model. The base case of the recurrence is an exercise for the reader. The second fundamental computational problem is to compute the denominator of the probability formula. This denominator is called the partition function: Z ( x, w) =
y
exp
j
wj Fj ( x, y ).
15
Remember that wj Fj ( x, y ) =
j i
gi (yi1 , yi ),
where i ranges over all positions 1 to n of the input sequence x , so we can write Z ( x, w) =
y
exp
i
gi (yi1 , yi ) =
y i
exp gi (yi1 , yi ).
We can compute the expression above efciently by matrix multiplication. For t = 1 to t = n + 1 let Mt be a square m by m matrix such that Mt (u, v ) = exp gt (u, v ) for any two tag values u and v . Note that M2 to Mn are fully dened, while M1 (u, v ) is dened only for u = START and Mn+1 (u, v ) is dened only for v = STOP. Consider multiplying M1 and M2 . We have1 M12 (START, w) =
v
= =
v,w
M1 (START, v )M2 (v, w)M3 (w, x) entry of the entire product M123...n+1 . This
START , STOP
We have T =
y
=
1
Note on notation: u, v , w, and x here are all single tags; w is not a weight and x is not a component of x .
16
which is exactly what we need. Computational complexity: Each matrix is m by m where m is the cardinality of the tag set. Each matrix multiplication requires O(m3 ) time, so the total time is O(nm3 ). We have reduced a sum over an exponential number of alternatives to a polynomial-time computation. However, even though polynomial, this is worse than the time needed by the Viterbi algorithm. An interesting question is whether computing the partition function is harder in some fundamental way than computing the most likely label sequence. The matrix multiplication method for computing the partition function is called a forward-backward algorithm. A similar algorithm can be used to compute any function of the form y hi (yi1 , yi ). Some extensions to the basic linear-chain CRF are not difcult. The output y must be a sequence, but the input x is treated as a unit, so it does not have to be a sequence. It could be an image for example, or a collection of separate items, e.g. telephone customers. In general, what is fundamental for making a log-linear model tractable is that the set of possible labels y should either be small, or have some structure. In order to have structure, y should be made up of parts (e.g. tags) such that only small subsets of parts interact directly with each other. Here, every interacting subset of tags is a pair. Often, the real-world reason interacting subsets are small is that interactions between parts are short-distance.
4.4
The learning task for a log-linear model is to choose values for the weights (also called parameters). Given a set of training examples, we assume now that the goal is to choose parameter values wj that maximize the conditional probability of the training examples. In other words, the objective function for training is the conditional log-likelihood (CLL) of the set of training examples. Since we want to maximize CLL, we do gradient ascent as opposed to descent. For online gradient ascent (also called stochastic gradient ascent) we update parameters based on single training examples. Therefore, we evaluate the partial derivative of CLL for a single training example, for each wj . (There is one weight for each feature-function, so we use j to range over weights.) Start with log p(y |x; w) = Fj (x, y ) log Z (x, w) wj wj 17
= Fj (x, y )
1 Z (x, w)
exp wj
wj Fj (x, y )
j
[exp
y j
Fj (x, y )
wj Fj (x, y ) j wj Fj (x, y )
= Fj (x, y )
= Fj (x, y ) Ey p(y |x;w) [Fj (x, y )]. In words, the partial derivative with respect to weight number i is the value of feature-function i for the true training label y , minus the average value of the feature-function for all possible labels y . Note that this derivation allows featurefunctions to be real-valued, not just zero or one. The gradient of the CLL given the entire training set T is the sum of the gradients for each training example. At the global maximum this entire gradient is zero, so we have Fj (x, y ) =
x,y T x, T
This equality is true only for the whole training set, not for training examples individually. The left side above is the total value of feature-function j on the whole training set. The right side is the total value of feature-function j predicted by the model. For each feature-function, the trained model will spread out over all labels of all examples as much mass as the training data has just on those examples for which the feature-function is nonzero. For any particular application of log-linear modeling, we have to write code to evaluate numerically the symbolic derivatives. Then we can invoke an optimization routine to nd the optimal parameter values. There are two ways that we can verify correctness. First, check for each feature-function Fj that Fj (x, y ) =
x,y T x, T y
Second, check that each partial derivative is correct by comparing it numerically to the value obtained by nite differencing of the CLL objective function. 18
Suppose that every feature-function Fj is the product of an attribute value aj (x) that is a function of x only, and a label function bj (y ) that is a function log p(y |x; w) = 0 if aj (x) = 0, of y only, i.e. Fj (x, y ) = aj (x)bj (y ). Then w j regardless of y . This implies that given example x with online gradient ascent, the weight for a feature-function must be updated only for feature-functions for which the corresponding attribute aj (x) is non-zero, which can be a great saving of computational effort. In other words, the entire gradient with respect to a single training example is typically a sparse vector, just like the vector of all Fj (x, y ) values is sparse for a single training example. A similar savings is possible when computing the gradient with respect to the whole training set. Note that the gradient with respect to the whole training set is a single vector that is the sum of one vector for each training example. Typically these vectors being summed are sparse, but their sum is not. When maximizing the conditional log-likelihood by online gradient ascent, the update to weight wj is wj := wj + (Fj (x, y ) Ey p(y |x;w) [Fj (x, y )]) where is a learning rate parameter. (4.2)
19
Fj ( x, y )p( y |x ; w ) exp wj Fj ( x, y ) . Z ( x, w)
j
= Fj ( x, y )
y
Fj ( x, y )
The rst term Fj ( x, y ) is fast to compute because x and its training label y are xed. Section 4.3 above shows how to compute Z ( x, w) efciently. The remainx, y ). x, y ) exp j wj Fj ( ing difculty is to compute y Fj ( If the set of alternative labels {y } is large, then it is computationally expensive to evaluate the expectation Ey p(y |x;w) [Fj (x, y )]). We can nd approximations to this expectation by nding approximations to the distribution p(y |x; w). Each section below describes a method based on a different approximation.
5.1
Suppose we place all the probability mass on the most likely y value, i.e. we use the approximation p (y |x; w) = I (y = y ) where y = argmaxy p(y |x; w) as before. 20
Then the update rule (4.2) at the end of the previous chapter simplies to the following rule: wj := wj + Fj (x, y ) wj := wj Fj (x, y ). Given a training example x, the label y can be thought of as an impostor compared to the genuine label y . The concept to be learned is those vectors of featurefunction values F1 (x, y ), . . . that correspond to correct x, y pairs. The vector F1 (x, y ), . . . , where x, y is a training example, is a positive example of this concept. The vector F1 (x, y ), . . . is a negative example of the same concept. Hence, the two updates above are perceptron updates: the rst for a positive example and the second for a negative example. The perceptron method causes a net increase in wj for features Fj whose value is higher for y than for y . It thus modies the weights to directly increase the probability of y compared to the probability of y .
5.2
Gibbs sampling
Computing the most likely label y does not require computing the partition function Z (x, w). Nevertheless, sometimes identifying y is still too difcult. In this case one option for training is to estimate Eyp(y|x;w) [Fj (x, y )] approximately by sampling y values from the distribution p(y |x; w). A method known as Gibbs sampling can be used to nd the needed samples of y . Gibbs sampling is the following algorithm. Suppose the entire label y can be written as a set of parts y = {y1 , . . . , yn }. For example, if y is the part-of-speech sequence that is the label of an input sentence x, then each yi can be the tag of one word in the sentence. Suppose the marginal distribution p(yi |x, y1 , yi1 , . . . , yi+1 , yn ; w) can be evaluated numerically in an efcient way for every i. Then we can get a stream of samples by the following process: (1) Select an arbitrary initial guess y1 , . . . , yn . (2) Draw y1 according to p(y1 |x, y2 , . . . , yn ; w); draw y2 according to p(y2 |x, y1 , y3 , . . . , yn ; w); 21
draw y3 according to p(y2 |x, y1 , y2 , y4 , . . . , yn ; w); and so on until yn . (3) Set {y1 , . . . , yn } := {y1 , . . . , yn } and repeat from (2). It can be proved that if Step (2) is repeated an innite number of times, then the distribution of y = {y1 , . . . , yn } converges to the true distribution p(y |x; w) regardless of the starting point. In practice, we do Step (2) some number of times (say 1000) to come close to convergence, and then take several samples y = {y1 , . . . , yn }. Between each sample we repeat Step (2) a smaller number of times (say 100) to make the samples almost independent of each other. Using Gibbs sampling to estimate the expectation Eyp(y|x;w) [Fj (x, y )] is computationally intensive because the accuracy of the estimate only increases very slowly as the number s of samples increases. Specically, the variance decreases proportional to 1/s. Gibbs sampling relies on drawing samples efciently from marginal distributions. Let yi be an abbreviation for the set {y1 , . . . , yi1 , ji+1 , . . . , yn }. We need to draw values according to the distribution p(yi |x, yi ; w). The straightforward way to do this is to evaluate p(v |x, yi ; w) numerically for each possible value v of yi . In typical applications the number of alternative values v is small, so this approach is feasible, if p(v |x, yi ; w) can be computed. Suppose the entire conditional distribution is a Markov random eld
M
p(y |x; w)
m=1
m (y m |x; w)
(5.1)
where each m is a potential function that depends on just a subset y m of components of y . Linear-chain conditional random elds are a special case of Equation (5.1). In this case p(yi |x, yi ; w)
mC
m (y m |x; w)
(5.2)
where C indexes those potential functions y m that include the part yi . To compute p(yi |x, yi ; w) we evaluate the product (5.2) for all values of yi , with the given xed values of yi = {y1 , . . . , yi1 , ii+1 , . . . , yn }. We then normalize using Z (x, yi ; w) =
v mC
m (y m |x; w)
5.3
Contrastive divergence
A third training option is to choose a single y value that is somehow similar to the training label y , but also has high probability according to p(y |x; w). Compared to the impostor y , the evil twin y will have lower probability, but will be more similar to y . The idea of contrastive divergence is to obtain a single value y = y1 , . . . , yn by doing only a few iterations of Gibbs sampling (often only one), but starting at the training label y instead of at a random guess.
23
Bibliographies on CRFs have been compiled by Rahul Gupta and Hanna Wallach. The following papers may be particularly interesting or useful. They are listed in approximate chronological order. Note that several are on topics related to CRFs, not on CRFs directly. 1. Michael Collins. Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms. Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing, pp. 1-8, 2002. 2. Sham Kakade, Yee Whye Teh, Sam T. Roweis. An alternate objective function for Markovian elds. In Proceedings of the 19th International Conference on Machine Learning (ICML), 2002. 3. Andrew McCallum. Efciently inducing features of conditional random elds. In Proceedings of the 19th Conference on Uncertainty in Articial Intelligence (UAI-2003), 2003. 4. Sanjiv Kumar and Martial Hebert. Discriminative random elds: A discriminative framework for contextual interaction in classication. In Proceedings of the Ninth IEEE International Conference on Computer Vision, 2003. 5. Ben Taskar, Carlos Guestrin and Daphne Koller. Max-margin Markov networks. In Advances in Neural Information Processing Systems 16 (NIPS), December 2003. 6. Thomas G. Dietterich, Adam Ashenfelter and Yaroslav Bulatov. Training conditional random elds via gradient tree boosting. In Proceedings of the 21st International Conference on Machine Learning (ICML), 2004. 7. Vladimir Kolmogorov and Ramin Zabih. What energy functions can be minimized via graph cuts? In IEEE Transactions on Pattern Analysis and Machine Intelligence, February 2004. 8. Charles Sutton, Andrew McCallum. Collective segmentation and labeling of distant entities in information extraction. ICML Workshop on Statistical Relational Learning, 2004. 9. Ioannis Tsochantaridis, Thorsten Joachims, Thomas Hofmann, Yasemin Altun. Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research, December 2005. 25
10. Hal Daum e III, John Langford, and Daniel Marcu. Search-based structured prediction. Submitted for publication, 2006. 11. Samuel Gross, Olga Russakovsky, Chuong Do, and Seram Batzoglou. Training conditional random elds for maximum labelwise accuracy. In Advances in Neural Processing Systems 19 (NIPS), December 2006.
26