Linear and Nonlinear Programming
Linear and Nonlinear Programming
It involves the notion of gradient and Hessian. Now a vector function f (x) is expressed as: f (x) l f (a)jf (xka)T ]f(a) jf (xka)T ]#(a) f (xka)j . f # approximation but necessitates more computation. Here the correction for iteration nj1 is w[n+ ] l w[n]j l w[n]k(H")(]f(w)) " (where ]f(w)is computed for w[n]). (36)
(31)
See also: Articial Neural Networks: Neurocomputation; Hebb, Donald Olding (190485); Neural Networks, Statistical Physics of; Perceptrons; Statistical Pattern Recognition
4.3 Iterati e Minimization A learning rule can be shown to converge to an optimum if it diminishes the value of the error function at each iteration. When the gradient of the error function can be evaluated, the gradient technique (or steepest descent) adjusts the weight vector by moving it in the direction opposite to the gradient of the error function. Formally, the correction for the (nj1)-th iteration is w[n+ ] l w[n]j l w[n]k]f(w) (32) " (where ]f(w)is computed for w[n]). As an example, let us show that for a linear heteroassociator, the WidrowHo learning rule minimizes iteratively the squared error between target and output. The error function is e# l (tko)# l t#jo#k2to l t#jxTwwTxk2twTx. The gradient of the error function is ce l 2(wTx)xk2tx lk2(tkwTx)x. cw
Bibliography
Abdi H 1994 Les ReT seaux de Neurones. PUG, Grenoble, France Abdi H, Valentin D, Edelman B 1999 Neural Networks. Sage, Thousand Oak, CA Bishop C M 1995 Neural Network for Pattern Recognition. Oxford University Press, Oxford, UK Ellacott S, Bose D 1996 Neural Networks: Deterministic Methods of Analysis. International Thomson Computer Press, London Hagan M T, Demuth H B, Beale M 1996 Neural Networks Design. PWS, Boston Haykin S 1999 Neural Networks: A Comprehensi e Foundation, 2nd edn. Prentice Hall, Upper Saddle River, NJ Reed R D, Marks R J 1999 Neural Smithing. MIT Press, Cambridge, MA Ripley B D 1996 Pattern Recognition and Neural Networks. Cambridge University Press, Cambridge, MA Rojas R 1996 Neural Networks. Springer-Verlag, New York
(33)
H. Abdi
The weight vector is corrected by moving it in the opposite direction of the gradient. This is obtained by adding a small vector denoted w opposite to the gradient. This gives the following correction for iteration nj1: ce w[n+ ] l w[n]jw l w[n]k " cw l w[n]j(tkwTx)x l w[n]j(tko)x. (35) This gives the rule dened by Eqn. 9. The gradient method works because the gradient of w[n] is a rst order Taylor approximation of the gradient of the optimal weight vector w. It is a favorite 4 technique in neural networks because the popular error backpropagation is a gradient technique. Newtons method is a second order Taylor approximation, it uses the inverse of the Hessian of w (supposing it exists). It gives a better numerical 8868
Linear and Nonlinear Programming are the backbone of the theoretical side of the area. On the practical side, the ability of these methods to solve very large problems (i.e., with a large number of variables) has allowed for the modeling of highly realistic and detailed real life situations, so that nowadays these methods are routinely applied to the day-to-day execution of complex tasks in a wide range of activities, like oil reneries, power stations, airlines and many others. At that time, the mathematization of economics was at a rather initial stage, and most mathematical results oered only new qualitative insights, rather than eective procedures for nding numerical solutions of economic problems, because of the complicated nature of the modeling tools, nonlinear in general. LP and SIMPLEX allowed actual computation of solutions for any problem where an economic agent (a rm, for instance) has to choose the most ecient among many possible courses of action, provided that all the relevant variables could be connected through linear relations, such as the input-output matrices proposed in the 1930s by V. Leontie, which enter the LP formulation as the matrix A of Eqn. (5) below. This situation gave rise to a new area of economic research, called activity analysis (Koopmans 1951, Gale 1960) which involved many of the most distinguished economists at the time, such as T. Koopmans, K. Arrow, D. Gale, and others. The inuence of LP en economics was reinforced by the introduction of the duality theory of LP. Shortly after Dantzig obtained his rst results on LP, he showed them to one of the leading mathematicians of the time, von Neumann, who was then developing another new branch of applied mathematics, namely game theory (Von Neumann and Morgenstern 1944). Observing immediately that a certain class of games tted the LP format, Von Neumann suggested the extension to LP of a game theoretic result, giving birth to LP duality, which, given an LP problem (e.g., Eqns. (2)(3) below), associates with it another problem, called dual, which shares exactly the same data c, b, and A, but arranged in a dierent way. Duality plays a very signicant role in economic models, where the variables of the dual problem, called shadow prices, have an interesting economic interpretation. The DantzigWolfe decomposition method (Dantzig and Wolfe 1960), which was a SIMPLEX-based method for solving certain special instances of LP problems, but which could also be seen as a model of a central planning procedure combined with autonomous decision by lower level agents, reinforced this impact of LP on economics, which produced perhaps, as a negative side eect, exaggerated attempts to force the modeling of nonlinear economic phenomena as linear ones. The impact of LP on mathematics was also signicant, and contributed to reinforcing the status of the new computational mathematics viz-a-viz the classical branches (e.g., analysis, geometry, algebra). Its success resulted from its ability to oer SIMPLEX as an ecient computational procedure and the LP duality theory as a theoretical counterpart. It should be mentioned that the mathematical tools required by the LP theory are rather elementary, and the theory could easily have been developed more than a century before it appeared. In fact, for the cases in which no inequalities are present (e.g., Eqns. (4)(5) without Eqn. (6), but with a nonlinear f instead of cx) the theory for the resulting equality constrained mini8869
Linear and Nonlinear Programming mization problem had been started by Lagrange at the beginning of the nineteenth century, and thousand of papers had been written on it, while only a handful of them had been devoted to the subject of linear inequalities, like Eqn. (3) or Eqn. (6), before 1945 (Dantzig 1963). A very likely reason for this oversight lies in the fact that the motivation for the study of minimization problems had been, for centuries, mainly research in physics, studying systems which move along deterministic trajectories, which are usually expressed by equalities. Inequalities, on the other hand, are typical of economic situations where agents are constantly confronting options for possible action rather than following xed trajectories determined by initial conditions. The low level of formalization of economics up to the 1940s, combined with the impossibility of developing at that time numerical methods, in the absence of computers, kept the eld of linear inequalities outside mainstream mathematics for a century and a half. meaning that the m inequalities, Eqn. (1), are satised with x* instead of xi (1 i n), and that i c x* jc x* j cnx* c x jc x j c n x n for all n" " #x # (x , ,nx ) "which# also satises the m " # vector l i n inequalities in Eqn. (1). In short mathematical notation, this is written as min cx s.t. Ax b. (2) (3)
An LP problem presented as Eqns. (2)(3) is said to be in canonical form. There are other forms which are equi alent to it, in the sense that a solution of the canonical form allows one to obtain in an immediate way a solution of the equivalent form and conversely. One such form consists of inverting all the inequalities in (1) ( instead of ). The most important equivalent form is the standard one, where the unknowns are required to be nonnegative (x 0, xn 0) and the inequalities in Eqn. (1) are " changed into equalities, which is denoted in short notation as min cx s.t. Ax l b x 0. (4) (5) (6)
We present next as an example a particular instance of the so called diet problem, which is a case of the LP in canonical form with reversed inequalities. The manager of a poultry ranch has three alternative chicken food products with prices c , c , and c per pound. Let " # $ a , a , and a be the protein content of a pound of "" "# products 1, 2, "$ 3 respectively, and a , a , and a and #" ## #$ the corresponding mineral contents per pound. If b " and b are the minimal amounts of protein and # required for every chicken in the ranch, then minerals the inequalities a x ja x ja x b "" " "# # "$ $ " a x ja x ja x b #" " ## # #$ $ # mean that an amount of x pounds of product 1, x # pounds of product 2, and" x pounds of product 3 $ satisfy such minimal nutritional requirements, and a solution x* l (x* , x* , x* ) of the diet problem " # $ min cx s.t. Ax b "
(1)
The problem consists of nding an n-vector x* l (x* , , x* ), belonging to F, which minimizes f (x) in F, n " 8870
indicates the amounts of each product which satisfy the nutritional requirements at the lowest possible cost.
meaning just that each variable xi must be between a lower bound li and an upper one ui, the resulting polyhedron F has 2n vertices, resulting from all the possible choices of xi l li or xi l ui for each of the n variables. Now, 2n grows very fast: for n l 100, checking all the vertices exceeds the capabilities of any existing computer. The SIMPLEX methods, instead of checking all the vertices, proceeds as follows. It constructs one vertex, say x, , and checks whether it is or not a solution (which is a rather easy task). If it is not, it does not proceed to just any other vertex, but to one which is adjacent to x, , which mathematically means that it shares with x, all but one of its nkm zero components. There are at most n adjacent vertices (corresponding to all possible choices of the nonshared column) and at least one, say x# , has an objective value f (x# ) f (x, ). Then x# is taken as the new candidate, and, if it is not a solution, then the process is repeated with vertices adjacent to x# , and so on, until a solution is
found, which is guaranteed to happen. How long does this procedure take? More than 50 years of practice show that, almost always, the number of visited vertices is not too high; generally less that m log n, which is negligible as compared to, for example 2n (rigorous versions of this statement on the average performance of SIMPLEX can be found in Borgwardt 1987 and Smale 1983). However, Klee and Minty (1972) found a highly articial example, with a carefully chosen objective function f, and a feasible set F obtained as a slight variation of the one given by Eqns. (7)(8), for which SIMPLEX goes through all the 2n vertices before nding the solution, which is just the last one. Despite Klee and Mintys example, SIMPLEX was absolutely superior to any other known procedure for solving LP problems, in terms of computational performance. Thus, the following theoretical issue was posed: does it exist a method for LP which is ne er too slow, for instance such that the number of arithmetic operations required to solve any particular LP problem does not exceed a xed power of the size of the problem (understood as some measure of the size of c, b, and A)? SIMPLEX is not up to this requirement, because it needs more than 2n operations to solve Klee and Mintys example, and 2n exceeds any xed power of n, which can be taken as a measure of the size of this specic LP problem. Methods with such property were called polynomial. The fact that no polynomial method for LP had been found in the rst 30 years after the introduction of SIMPLEX, made most mathematicians think that no such method could ever be devised, and therefore it was a surprise when Kachyan presented such a procedure in 1979 (Kachyan 1979). Unfortunately, Kachyans method, called the elipsoidal one, turned out to be far worse than SIMPLEX in almost all problems (though, of course, much better in the very rare cases in which SIMPLEX takes almost forever, like in Klee and Mintys example). Thus, the elipsoidal method aroused only theoretical interest, and did not replace SIMPLEX in any real life application. Nevertheless, existence of a polynomial method (albeit inecient in the average) encouraged research attempting to nd a procedure which would be both polynomial and ecient in practice (i.e., good both in the average cases and in the worst ones). Such a method was discovered by Karmarkar in 1984 (Karmarkar 1984), and extensive progress has been achieved in this class of methods for LP (called interior point ones) since the mid-1980s (Roos et al. 1997). They dier from SIMPLEX in the fact that, instead of considering only vertices, which lie in the boundary of the feasible polyhedron F, all the intermediate vectors, which are tested as candidates for solutions, are taken in the interior of this polyhedron. Thus, these methods fully neglect the combinatorial nature of LP problems (i.e., the facts that for an LP problem with solutions there exists a vertex which is a solution, and that the number of vertices is nite). Interior point methods are 8871
Linear and Nonlinear Programming quite ecient, but none of the many of those which have been proposed turned out to be universally superior to all others (nor to SIMPLEX, for that matter). Currently, the situation in LP, regarding computational methods, is similar to most other applied mathematics areas: instead of a method universally superior to all others, as was the case for LP between 1947 and 1984, a variety of methods are available, some of which are clearly superior, in terms of computational performance, for some specic cases of LP problems, but inferior to at least another one for other types of LP problems. D; in fact, no known method is able to provide exact solutions for an arbitrary quadratic programming problem. Another very important theoretical issue refers to necessary and\or sucient conditions on an n-vector x* to be a solution of Eqns. (9)(10). In the case of equality constraints, that is, when all the inequalities in Eqn. (10) are replaced by equalitiessuch a condition was known in the early nineteenth centuryif we denote as ] f (x) the n-vector whose j-th component is the derivative of f with respect to its j-th variable at the n-vector x, and we use the same denition for ] gi(x) (1 i m), then the so called Lagrangian condition states that ] f (x) l u ] g (x)j(jum ] gm (x) (12) " " where the real numbers u , , um are called Lagrangian " multipliers. In fact, if x solves Eqns. (9)(10), then Lagrange multipliers do exist. Conversely, existence of Lagrange multipliers for a given n-vector x*, together with an additional condition involving second order derivatives of f and the gis at x*, guarantee that x* is a local solution of the problem, meaning that f (x*) f (x) for all x satisfying Eqn. (10) and close enough to x* (i.e., such that all the dierences x*kxi i are small enough). The extension of the Lagrangian condition to the case of inequality constraints was the rst theoretical achievement of NLP. The resulting conditions, called KarushKuhnTucker ones, were published by Kuhn and Tucker (1951), and had been anticipated in Karushs MSc dissertation. They state that the Lagrange multipliers, besides satisfying the Lagrangian condition, Eqn. (12), must be nonnegative, and also that ui l 0 if gi(x) 0. Also, an additional condition, called constraint qualication, must be imposed on the feasible region, that is, on the set of n-vectors which satisfy Eqn. (10). The 150 years between the establishment of the conditions for the equality and the inequality constrained cases is again just a consequence of classical mathematicians lack of interest in inequalities, as discussed above. A third important theoretical issue was the extension to NLP of the duality theory of LP. This was achieved only in the con ex case (Rockafellar 1970) that is, when both f and all the gis are convex. We recall that a function h is convex when h ( xj(1k)y) h (x)j(1k)h ( y)
for all n-vectors x and y and all real number between 0 and 1. Though NLP is used extensively in many real-life applications, it is much less widespread than LP. This is due to the facts that NLP lacks an universal method, like SIMPLEX for LP, so that the procedure to be used must be chosen according to the nature of the specic problem, and that NLP methods in general do not work as black boxes, fed with the problem data and handing over the solution, but rather must be ne
Linear and Nonlinear Programming tuned, choosing appropriate parameters, properly interpreting the output of the method, etc. Thus, an expert in Operations Research, unavailable in most moderate sized institutions, is required for successful implementation and solution of NLP problems. On the other hand, the impact of NLP on mathematics itself has been much deeper than the one of LP. Important results in several classical areas, like convex analysis and variational analysis, have been drastically improved as a consequence of the development of NLP theory (see Rockafellar 1970, Rockafellar and Wets 1998). problem and approximate Lagrange multipliers, as dened by (12). Ecient implementations of any of the methods above usually require knowledge of the rst and second derivatives of f and the gis. Also, excepting in the convex case, the generated n-vector approximate in general just local solutions, as discussed in Section 5. Recent research in NLP has attempted to overcome these two limitations. Nonsmooth optimization is devoted to devising methods which work for problems whose data functions do not have derivatives (Clarke 1983), while global optimization addresses the issue of nding global solutions, that is, n-vectors x* satisfying Eqn. (10) and such that f (x*) f (x) for all other nvector x which also satises Eqn. (10), and not only for n-vectors close to x* (Horst and Tuy 1993). See also: Linear Algebra for Neural Networks; Linear Hypothesis
Bibliography
Avriel M 1976 Nonlinear Programming, Analysis and Methods. Prentice Hall, Upper Saddle River, NJ Bertsekas D P 1982 Constrained Optimization and Lagrange Multipliers. Academic Press, New York Boot J C G 1964 Quadratic Programming. North Holland, Amsterdam Borgwardt K H 1987 The Simplex Method: A Probabilistic Analysis. Springer, Berlin Clarke F H 1983 Optimization and Nonsmooth Analysis. Wiley, New York Dantzig G B 1949 Programming of independent activities II, mathematical model. Econometrica 17: 20011 Dantzig G B 1963 Linear Programming and Extensions. Princeton University Press, Princeton, NJ Dantzig G B, Wolfe P 1960 Decomposition principle for linear programs. Operations Research 8: 10111 Dennis J E, More! J J 1977 Quasi-Newton methods: Motivation and theory. SIAM Re iew 19: 4689 Dennis J E, Schnabel R B 1983 Numerical Methods for Unconstrained Optimization and Nonlinear Equations. Prentice Hall, Upper Saddle River, NJ Gale D 1960 The Theory of Linear Economic Models. McGrawHill, New York Gomes F M, Maciel M C, Mart! nez J J 1999 Nonlinear programming algorithms using trust regions and augmented Lagrangians with nonmonotone penalty parameters. Mathematical Programming 84: 161200 Hillier F S, Lieberman G J 1980 Introduction to Operations Research. Holden Day, San Francisco Horst R, Tuy H 1993 Global Optimization. Deterministic Approaches. Springer, Berlin Kantorovich L V 1960 Mathematical methods in the organization and planning of production. Management Science 6: 366422 Karmarkar N A 1984 A new polynomial time algorithm for linear programming. Combinatorica 4: 37395 Khachiyan L G 1979 A polynomial algorithm in linear programming. Doklady Akademiia Nauk SSSR 244: 10936 Klee V, Minty G J 1972 How good is the simplex algorithm. In: Shisha O (ed.) Inequalities-III. Academic Press, New York pp. 15975
8873
fathers and mothers education, family size and intactness, home items, and percent of fathers who are white collar. Sta salaries per pupil is x . The average # score on a verbal test given to the schools teachers is x . Denoting dierent schools using the subscript i and $ the mean of yi by mi, a linear hypothesis states that for some unknown numbers (parameters) , , , ! " # and , $ mi l j xi j xi j xi ! " " " # $ $ Mosteller and Tukey (1977, pp. 326, 566) and Mosteller et al. (1983, pp. 40820) give excerpts and analysis of the data. In other applications, the predictors only identify whether an observation is in some group. For example, in 1978 observations yij were collected on the age at which people in Albuquerque committed suicide, see Koopmans (1987, p. 409). Here i is used to identify the persons group membership (Hispanic, Native American, non-Hispanic Caucasian), and j identies individuals within a group. The three categories are taken to be mutually exclusive for the present discussion (although the US government now allows for individuals to identify themselves with multiple races in various surveys and the decennial census). We can dene group identier predictor variables. Let i take " the value 1 if an individual belongs to group 1 (Hispanic) and 0 otherwise, with similar predictors to identify other groups, say i and i for Native $ Americans and non-Hispanic # Caucasians. Note that the predictor variables do not depend on the value of j identifying individuals within a group. Denoting the mean of yij by mij, a linear hypothesis states that for some unknown parameters , , , , " # $ mij l j ij ij i " " # # $ $ Since two of the s are always zero, this model is often written more succinctly as mij l ji A linear hypothesis is usually combined with other assumptions about the observations y. Most commonly, the assumptions are that the observations are independent, have the same (unknown) variance #, and have normal (Gaussian) distributions. For the two examples, these assumptions are written yi indep. N(mi, #) and yij indep. N(mij, #)
A. N. Iusem
Linear Hypothesis
1. Introduction
The term linear hypothesis is often used interchangeably with the term linear model. Statistical methods using linear models are widely used in the behavioral and social sciences, e.g., regression analysis, analysis of variance, analysis of covariance, multivariate analysis, time series analysis, and spatial data analysis. Linear models provide a exible tool for data analysis and useful approximations for more complex models. A common object of linear modeling is to nd the most precise linear model that explains the data, to use that model to predict future observations, and to interpret that model in the context of the data collection. Traditionally, analysis of variance models have been used to analyze data from designed experiments while regression analysis has been used to analyze data from observational studies but the techniques of both analysis methods apply to both kinds of data. See also Experimental Design: O er iew and Obser ational Studies: O er iew.
2. Denition
The linear hypothesis is that the mean (average) of a random observation can be written as a linear combination of some observed predictor variables. For example, Coleman et al. (1996) provides observations on various schools. The dependent variable y consists of the average verbal test score for sixth-grade students. The report also presents predictor variables. A composite measure of socioeconomic status x is based on " 8874
where, for example, N(mi, #) indicates a normal distribution with mean mi and variance #. Incorporating these additional assumptions, the linear
Copyright # 2001 Elsevier Science Ltd. All rights reserved. International Encyclopedia of the Social & Behavioral Sciences
ISBN: 0-08-043076-7