L1 Norm Based Computational Algorithms

This paper gives a rather general review of the L1 norm algorithms. The chronology and historical development of the L1 norm estimation theory for the period of 1632-1928 will be surveyed and the algorithms belonging to the after 1928 period will be categorized into three main classes of direct descent, simplex type, and other algorithms.

Historical development  The origin of L1 norm estimation may be traced back to Galilei (1632). In determining the position of a newly discovered star, he proposed the least possible correction in order to obtain a reliable result (see, Ronchetti (1987) for some direct quotations). Boscovich (1757) for the first time, formulated and applied the minimum sum of absolute errors for obtaining the best fitting line given three or more pairs of observations for a simple two-variable regression model. He also restricts the line to pass through the means of the observation points. That is, n min : Σ │yi-ß0-ß1xi1│ ß0,ß1 i=1 n (1) s.to: Σ (yi-ß0-ß1xi1)=0 i=1 Boscovich (1760) gives a simple geometrical solution to his previous suggestion. This paper has been discussed by Eisenhart (1961) and Sheynin (1973). In a manuscript, Boscovich poses the problem to Simpson and Simpson gives an analytical solution to the problem (see, Stigler (1984)). Laplace (1793) provides an algebraic formulation of an algorithm for the L1 norm regression line, which passes through the centroid of observations. In Laplace (1799), an extension of L1 norm regression to observations with different weights has also been discussed. Prony (1804) gives a geometric interpretation of Laplace's (1799) method and compares it with other methods through an example. Svanberg (1805) applies Laplace's method in determining a meridian arc, and Von Lindenau (1806) uses this method in determination of the elliptic meridian. Gauss (1809) suggests the minimization of the sum of absolute errors without constraint. He concludes that this criterion necessarily sets m of the residuals equal to zero, where m is the number of parameters, and further, the solution obtained by this method is not changed if the value of the dependent variable is increased or decreased without changing the sign of the residual. This conclusion is recently discussed by Narula and Wellington (1985). He also noted that Boscovich or Laplace estimators which minimize the sum of absolute residuals with zero-sum of residuals constraint, necessarily set m-1 of the residuals equal to zero (see, Stigler (1981), Farebrother (1987b)). Mathieu (1816) used Laplace's method to compute the eccentricity of the earth. Van Beeck-Calkoen (1816) advocates the using of the least absolute values criterion in fitting curvilinear equation obtained by using powers of the independent variable. Laplace (1818) adapted Boscovich's criterion again and gave an algebraic procedure (see, Farebrother (1987b)). Let 1 x and y be the means of xi1 and yi then, ß0 = y -ß1 1 x (2) Value of ß1 is found by, n min: S = Σ │yi~ -ß1xi1~│ (3) ß1 i=1 where xi1~ and yi~ are deviations of xi1 and yi from their means respectively. By rearranging the observations in descending order of yi~/xi1~ values, Laplace notes that S is infinite when ß1 is infinite and decreases as ß1 is reduced. ß1 reaches the critical value yt~/xt1~ when it again begins to increase. This critical value of ß1 is determined when, t-1 n t Σ │xi1~│ < ½ Σ │xi1~│ ≤ Σ │xi1~│ This procedure to find ß1 is called weighted median and has been used in many other algorithms such as Rhodes (1930), Singleton (1940), Karst (1958), Bloomfield and Steiger (1980), Bidabad (1987a,b,88a,b,89a,b) later. Bidabad (1987a,88a,89a,b) derives the condition (4) via discrete differentiation method.
Fourier (1824) formulates least absolute residuals regression as what we would now call linear programming; that is the minimization of a linear objective function subject to linear inequality constraints.
Edgeworth (1883) presents a philosophical discussion on differences between minimizing mean square errors and mean absolute errors. Edgeworth (1887a,b) proposed a simple method for choosing the regression parameters. By fixing m-1 of the parameters, he used Laplace's procedure to determine the optimal value of the remaining parameter. Repeating this operation for a range of values for m-1 fixed parameters, he obtained a set of results for each of m possible choices of the free parameters. Edgeworth drops the restriction of passing through the centroid of data. Turner (1887) discusses the problem of nonunique solutions under the least absolute error criterion as a graphical variant of Edgeworth (1887a) as a possible drawback to the method. Edgeworth (1888) replies to Turner's criticism by proposing a second method for choosing the two parameters of least absolute error regression of a simple linear model which makes no use of the median loci of his first method. Edgeworth, in this paper, followed Turner's suggestion for graphical analysis of steps to reach the minimum solution.
Before referring to double median method of Edgeworth (1923), it should be noted that Bowley (1902) completes the Edgeworth's (1902) paper by a variant of double median method which presented after him by Edgeworth (1923). This variant ignores the weights attached to errors. Edgeworth (1923) discussed the more general problem of estimating the simple linear regression parameters by minimizing the weighted sum of the absolute residuals. He restates the rationale for the method and illustrates its usage through several examples. He also considers the nonunique solution problem. His contribution is called double median method.
Estienne  proposes replacing the classical theory of errors of data based on least squares with what he calls a rational theory based on the least absolute residual procedure. Bowley (1928) summarizes the Edgeworth's contributions to mathematical statistics, which includes his work on L1 norm regression. Dufton (1928) also gives a graphical method of fitting a regression line. Farebrother (1987b) summarizes the important contributions to L1 norm regression for the period of 1793-1930. For more references see also Crocker (1969), Harter (1974a,b,75a,b,c,76), Dielman (1984). Up to 1928, all algorithms had been proposed for simple linear regression. Though some of them use algebraic propositions, are not so organized to handle multiple L1 norm regression problem. In the next section, we will discuss the more elaborated computational methods for simple and multiple L1 norm regressions not in a chronological sense; because many digressions have occurred. We may denote the period of after 1928 the time of modern algorithms in the subject of L1 norm regression.

Computational algorithms
Although a closed form of the solution of L1 norm regression has not been derived yet, many algorithms have been proposed to minimize its objective function (see, Cheney (1966), Chambers (1977), Dielman and Pfaffenberger (1982,84)). Generally, we can classify all L1 norm algorithms in three major categories as, i) Direct descent algorithms ii) Simplex type algorithms iii) Other algorithms which will be discussed in the following sections sequentially.

Direct descent algorithms
The essence of the algorithms which fall within this category is finding a steep path to descend down the polyhedron of the L1 norm regression objective function. Although the Laplace's method (explained hereinbefore) is a special type of direct descent algorithms; the origin of this procedure in the area of L1 norm can be traced back to the algorithms of Edgeworth which were explained in the previous section.
Rhodes (1930) found Edgeworth's graphical solution laborious; therefore, he suggested an alternative method for a general linear model, which may be summarized as follows (see, Farebrother (1987b)). Suppose, we have n equations with m<n unknown parameters. To find L1 norm solution of this overdetermined system of equations; Step 1) Select m-1 equations arbitrarily.
Step 2) Solve these equations for m-1 parameters.
Step 3) Estimate the remaining m th parameter by Laplace's method.
Step 4) Recognize the resulting equation in step 3 and add it to the m-1 equations in step 1.
Step 5) If the set of m equations has recurred m times then stop; otherwise discard the oldest equation and go to step 2. Rhodes (1930) explained his algorithm by an example and did not give any proof for convergence. Bruen (1938) reviews the L1 norm regression methods presented by earlier authors. He also compares L1, L2, and L∞ norms regressions. Singleton (1940) applied Cauchy's steepest descent method (see, Panik (1976)) for the general linear L1 norm regression. In this paper, a geometrical interpretation of gradient on L1 norm polyhedron and some theorems about existence and uniqueness of solution and convexity property all were given. Although the paper has not been clearly written, the following steps summarize his algorithm.
i=1 The t value is the length of movement along the direction of the gradient.
Step 6) Test the optimality condition. Singleton gives a condition to stop, but it is not quite clear. In this step, any other criterion relevant to L1 norm function may be displaced.
Step 7) This step is not well defined by Singleton. In this phase, he tries to choose the best gradient among usable gradients.
Without this step, algorithm is still operational, because, the steps are all standards of Cauchy steepest descent method and instead of choosing the best gradient, we can proceed by going to step 2. Bejar (1956,57) focuses on consideration of residuals rather than on the vector of parameters. He puts forth a procedure with the essence of Rhodes (1930). However, he is concerned with two and three parameter linear models. Karst (1958) gives an expository paper for one and two parameter regression models. In his paper, Karst without referring to previous literature actually reaches to the Laplace proposition to solve the one parameter restricted linear model, and for the two-parameter model, he proposed an algorithm similar to that of Rhodes (1930). His viewpoint is both geometrical and algebraic, and no proof of convergence for his iterative method is offered. Sadovski (1974) uses a simple "bubble sort" procedure and implements Karst algorithm in Fortran. Sposito (1976) pointed out that the Sadovski's program may not converge in general. Sposito and Smith (1976) offered another algorithm to remove this problem. Farebrother (1987c) recodes Sadovski's implementation in Pascal language with some improvement such as applying "straight insert sort". Usow (1967b) presents an algorithm for L1 norm approximation for discrete data and proves that it converges in a finite number of steps. A similar algorithm on L1 norm approximation for continuous data is given by Usow (1967a). Given the function f(x) defined on a finite subset X={x1,...,xn} of an interval on the real line and also linearly independent continuous functions Φ1(x),...,Φm(x) where m<n; consider the "polynomial" m L(ß,x)=ΣßjΦj(x). In Usow (1967b) the following function is to be minimized: j=1 n n m min: where ß is a vector of size m. The above form is general; if we let f(xi)≡yi and Φj(xi)≡xij the function S becomes the standard linear regression objective function. Now let the set K be: K={(ß,d)│(ß,d)εE m+1 ,S≤d} (6) Then K is a convex polytope, the vertices of which occur only when L(ß,x)-f(x) is zero at m or more points of X. The Usow's algorithm is to descend on K from vertex to vertex along connecting edges of the polytope in such a way that certain intermediate vertices are by-passed. This descent continues until the lowest vertex (ß*,d*) is reached. To clarify the algorithm assume that we are at the vertex (ß k ,d k ) on K and the polynomial L(ß k ,x) interpolates m points of X denoted by U k ={u1 k ,...,um k }. Thus, m The m coefficients aj i are calculated as follows. Form the matrix (Φj(ui k )) for i,j=1,...,m. Let π(x) and Φ(x) be two mx1 vectors for any xεX whose elements are πi(x) and Φi(x) for i=1,...,m respectively. Hence, we can derive, π(x)=[(Φj(ui k )) T ]-1Φ(x) (10) aj i 's are the elements of the i th row of the matrix [(Φj(ui k )) T ]-1, where the superscript T denotes transposition. Let ei be a zero vector of size m where its i th element is equal to one. Then if for some δ, S(F k -δei)<S(Fk), there is a Tj such that Tjδ>0 and S(F k -Tj ei)<S(Fk). Also, S(Fk -Tjei) = min {S(Fk -tei)} (11) t and ((F k -Tjei),S(F k -Tjei)) is a vertex. On the other hand, a point ui k εU k may be replaced by a point ui k+1 ε{X-U k } such that the polynomial L(ß ki ,x) interpolating Ui k ={u1 k ,...,ui k+1 ,...,um k } and S(ß ki )<S(ß k ). S(ß Ki ) is the minimum of all norms obtained if ui k were replaced by the different points of the set {X-U k }, as indicated by the above relation.
In going from vertex (ß k ,S(ß k )) to (ß ki ,S(ß ki )), one or more vertices on K might have been by-passed. The nearest vertex to (F k ,S(F k )) and below it on the edge parallel to the i th parameter space coordinate axis, say the vertex ((F k -trei),S(F k -trei)), is obtained from: , then S(F k ) could not be reduced by moving on K along the edge parallel to the i th parameter space coordinate axis and ui k should not be replaced by another point from the set {X-U k }. This iteration is repeated m times, once for each point in U k in succession. The whole cycle is then repeated a finite number of times until the solution (ß*,d*) is reached (see also, Abdelmalek (1974)).
Relation of this algorithm with the simplex method has been discussed by Abdelmalek (1974). He shows that Usow's algorithm is completely equivalent to a dual simplex algorithm applied to a linear programming model with nonnegative bounded variables, and one iteration in the former is equivalent to one or more iterations in the latter. Bloomfield and Steiger (1980) devise an efficient algorithm based on the proposition of Usow explained above. Sharpe (1971) by applying the L1 norm regression to the portfolio and its rate of return, gives an algorithm for the two parameters linear regression model. He argues that for the simple model with the objective function, it must be possible to assign half of the points above and half below the regression line. With any given value of ß~1, we can derive ß0 as the median of ß0i=yi-ß~1xi1. Now segregate the points, such that: (15) iabove ibelow Rearranging the terms and note that nabove=nbelow gives, The overall solution strategy may now be formulated as follows. Any value of ß1 may be chosen at the outset. By computing ß0i, segregate the points above and below the line. By using equations (16) through (18), the corresponding segment of the S and ß1 relation is calculated. Sign of k2 indicates the appropriate direction for the next iteration. If k2 is positive, only smaller values of ß1 need be considered. If k2 is negative, only larger values of ß1 need be considered. If k2 is zero, the initial value of ß1 is a solution.
When a borderline (median of ß0i) and the direction of search is determined, the nearest intersection of the present borderline with another should be found. The calculations can be reduced by comparing slopes (xi1 values) to determine whether or not two lines intersect in the region of interest (thus avoiding needless division operations). The new values k1 and k2 must be computed. If this alteration causes k2 to change sign, the solution has been obtained, and the algorithm stops.
Rao and Srinivasan (1972) interpret Sharpe's procedure as the solution of parametric dual linear programming formulation of the problem. They give an alternate and about equally efficient procedure for solving the same problem. Brown (1980) gives a distinct but similar approach to those of Edgeworth (1923) and Sharpe (1971). He emphasizes on the median properties of the estimator. The similarity comes from the graphical approach of the three authors. Kawara (1979) also develops a graphical method for the simple regression model. Bartels and Conn and Sinclair (1978) apply the method of Conn (1976) to the L1 norm solution of an overdetermined linear system. Their approach is a minimization technique for piecewise differentiable functions. The algorithm may be reduced as follows.

m
Let the orthogonal projector onto N be denoted by PN.
τ This algorithm has also been modified for the case of degeneracy (see also, Bartels and Conn and Sinclair (1976)). Bartels and Conn (1977) showed that how L1 norm, restricted L1 norm, L∞ norm regressions, and general linear programming can all be easily expressed as a piecewise linear minimization problem. Let U and v be of sizes pxm and px1, respectively. Consider the function: Where yi-xi T ß represents the i th element of the residual vector y-Xß and vj-uj T ß represents the j th element of the residual vector v-Uß. To minimize Φ with respect to ß the following steps should be taken.
Step 2) Choose Θ (k) ≥0 to obtain the largest possible decrease in Φ.
Step 3) Let ß (k+1) =ß (k) +Θ (k) δ (k) . When h=0 and the sum on j is vacuous in the Φ(ß) function (19), the resulting simplification of the above steps corresponds precisely to the algorithm proposed by Bartels and Conn and Sinclair (1978). Other related problems can be solved by modification of Φ(ß) by a quantity µ to obtain a parameterized family of piecewise linear functions Φµ(ß), and take following steps to find the minimum solution.
Step 1) Minimize Φµ(ß) with respect to ß according to the above procedure.
Step 2) Stop if a prescribed terminating condition on ß is met; otherwise set µ=µ/10 and go to step 1. The contribution of this paper is putting a wide class of problems in the mould of two algorithms mentioned above. The techniques are easily extended to the models with norm restrictions. Bloomfield and Steiger (1980) proposed a descent method for the L1 norm multiple regression. Their algorithm is also explained in Bloomfield and Steiger (1983). In some steps, this algorithm is related to that of Singleton (1940) and Usow (1967b). The basis of this method is to search for a set of m observations which locate on the optimal L1 norm regression. This set is found iteratively by successive improvement. In each iteration, one point from the current set is identified as a good prospect for deletion. This point is then replaced by the best alternative. The novel features of this method are in an efficient procedure for finding the optimal replacement and a heuristic method for identifying the point to be deleted from the pivot.
Denote the x1 T ,...,xm T as rows of the independent variables design matrix which correspond to the current set of points 1,...,m; and xm T for replacement. Set, yi = xi T ß i=1,...,m-1 (20) Redefine ß as, ß = ß 0 + tδ (21) Where ß 0 is an arbitrary member of the set and δ vector obeys the following system, xi T δ = 0 i=1,...,m-1 (22) Given this set of points, the optimum value of S may be found by minimizing the following expression with respect to the scalar t.
n Σ │yi -xi T (ß 0 + tδ)│ (23) i=1 Rearranging the terms leads to: n Σ │wi││ri -t│ (24) i=1 where wi=xi T δ and ri=(yi-xi T ß 0 )/(xi T δ). Value of t may be found by Laplace's weighted median method. Bloomfield and Steiger propose a weighted modification of partial quicksort procedure of Chamber (1971) to find the t value efficiently. The data point corresponding to the weighted median replaces xk T . By using the yi from (20) and the additional m th equation, the value of ß 0 is computed. Vector δ is determined up to scalar multiples of (22). The new set of parameters values are computed by (21). Now a point should be deleted. Bloomfield and Steiger do not give an assured way to identify this point. They propose a heuristic method based on gradients and use the following quantity: Once ρ is calculated for each candidate point for deletion, and delete the point for which ρ is largest.
To start the algorithm, any set of m rows of X may be chosen, with the appropriate ß 0 . Add variables stepwise until a fit for ß 0 and the corresponding set of m points are derived. At each intermediate step, the fit involves k variables where 0≤k<m and the corresponding set of k data points with zero residuals. Improving the fit by entering a new variable, thus increasing k to k+1. At each stage, it is the measure ρ that determines whether we set up to a larger model or improve the current one without setting up. Suppose at the current stage we are dealing with k variables. If the value of ρ is largest at the variable p, which p is not in the set of k variables above, k is increased to k+1 by entering the variable p and improving the fit with k+1 variables. If p is in the current set of k variables and ρ is in the largest value, the corresponding point to p is deleted and replaced in a manner described before. In this paper relationship of this algorithm to linear programming is also discussed. Seneta and Steiger (1984) proposed an algorithm for the L1 norm solution of a slightly overdetermined system of equations. Their proposition is based on the above algorithm of Bloomfield and Steiger. It is more efficient than the former if m is near n. Given (xi,yi)εR m+1 , i=1,...,n and k=n-m, X=(x1,...,xn) T , their algorithm may be described as follows: Step 1) Renumber the rows of (X|y) such that XN the bottom m rows of X, is invertible. Solve the k linear equations system XN T N=-XT T for N, where XT denotes the top k rows of X.
Step 2) Let D=(A|c) where A=(I|N) is of size kxn and c=Ay.
Step 16) Divide row p of D by Dp σ c(q).
Step 20) Go to step 5. Seneta (1983) reviews the iterative use of weighted median to estimate the parameters vector in the classical linear model when the fitting criterion is L1 norm and also Cauchy criterion. Wesolowsky (1981) presents an algorithm for multiple L1 norm regression based on the notion of edge descent along the polyhedron of the objective function. This algorithm is closely related to those of Rhodes (1930) and Bartels and Conn and Sinclair (1978) which explained before. Consider the multiple linear regression as before. Select a set of m points (xj1 I ,...,xjm I ,yj I ).
The following system of equations can be solved for a unique set of coefficients. m yj I -Σ ßhxjh I = 0, j=1,...,m (26) h=1 An edge is formed of any subset J consisting of m-1 equations. To minimize along an edge, set: Let ß p be formed from ß by deleting ß p and let x p be the p th column in X J and let X pJ be formed by removing x p from X J . Then for a given ß p it can be shown that, ß p = X pJ -1 Y J -ß p X pJ -1 x p (28) Let the elements of ß p be ß q = r q -s q ß p for q ≠ m (29) Substitution in the L 1 norm objective function gives, Now the following steps should be taken.
Step 2) Set k=k+1. Obtain ß p for the smallest p from (30) by using the weighted median procedure. Set ß p (k) =ß p and let i be the index which defines the lower weighted median ß p in (30) for the lowest possible p.
In this paper, Wesolowsky also discusses the problem of multicolinearity and gives an appropriate solution.
Josvanger and Sposito (1983) modify Wesolowsky's algorithm for the two-parameter simple linear regression model. The modification is an alternative way to order observations instead of sorting all of them to find the necessary weighted median value. Suppose the problem has been reduced to a weighted median problem. They place smaller values of factors to be sorted with corresponding weights below ß 1 (k-1) and larger or equal values above it, then recheck the inequalities (4) of the weighted median. If the inequalities do not satisfy, then an appropriate adjustment is made. In particular, if the right-hand side is overly weighted, then the weight corresponding to the smallest sorting factor is transferred to the left-hand side, and the check is made again. A computer program for this algorithm is also given by the authors.
"Generalized gradient" method introduced by Clarke (see, Clarke (1983)) is a general procedure for nonsmooth optimization functions and problems (see, Osborne and Pruess and Womersley (1986)). A subclass of this method is called "reduced gradient" explained by Osborne (1985) is a general algorithm which contains linear programming, piecewise linear optimization problems, and polyhedral convex function optimization algorithms inside. The reduced gradient algorithm is a special case of descent method, which possesses two important characteristics. Identify direction and taking a step in this direction to reduce the function value (see also, Anderson and Osborne (1976), Osborne andWatson (1985) Osborne (1985,87)). The algorithms of Bartels and Conn and Sinclair (1978), Armstrong and Frome and Kung (1979), Bloomfield and Steiger (1980) are all special cases of reduced gradient method. Imai and Kato and Yamamoto (1987) present a linear time algorithm for computing the two-parameter L 1 norm linear regression by applying the pruning technique. Since the optimal solution in the ß 0 xß 1 plane lies at the intersection of data lines, so, at each step, a set of data lines which does not determine the optimum solution are discarded. In this paper algebraic explanation of the problem is also offered. Pilibossian (1987) also gives an algorithm similar to Karst (1958) for the simple two-parameter linear L 1 norm regression. Bidabad (1987a,b,88a,b) proposed descent methods for the simple and multiple L 1 norm regressions. These algorithms, with many improvements, will be discussed by Bidabad (1989a,b). Bidabad (1989a,b) introduces four descent algorithms which two of them are for simple, and two others are for multiple regression models.His first algorithm is crude and tries to check many points to find the optimal solution. The second algorithm is more efficient. Algorithm three is a partial descent procedure for the general linear model. Algorithm four is a full descent method which has many proved properties. He also proves the convergence of all the above four algorithms.

Simplex type algorithms
The essence of linear programming in solving L 1 norm problem may be found in the work of Edgeworth (1888). Harris (1950) suggested that the L 1 norm estimation problem is connected with linear programming. Charnes and Cooper and Ferguson (1955) formulated the problem as a linear programming model. This article is the first known to use linear programming for this case. Adaptation of linear programming to L 1 norm estimation problem is shown below: min: 1 n T (w+v) ß s.to: Xß+I n (w-v)=y (31) w,v≥0 ß unrestricted in sign where 1 n is a vector of size nx1 of 1's and I n is a n th order identity matrix. The vectors v and w are of size nx1 and their elements may be interpreted as vertical deviations above and below the fitted regression hyperplane respectively. This problem has n equality constraints in m+2n variables. When n is large, this formulation generally requires a large amount of storage and computation time. Wagner (1959) shows that the formulation of the L 1 norm regression may be reduced to m equality constraints linear programming problem. Thus, this dual formulation reduces n equations of primal form to m equations of dual form and considerably reduces the storage and computation time.
Fisher (1961) reviews the formulation of the L 1 norm estimation in relation to the primal form of linear programming. Barrodale and Young (1966) developed a modified simplex algorithm for determining the best fitting function to a set of discrete data under the L 1 norm criterion. The method is given as Algol codes (for critics see, ). Davies (1967) demonstrates the use of the L 1 norm regression estimates. Rabinowitz (1968) also discusses the application of linear programming in this field. Crocker (1969) cautions against using the L 1 norm criterion merely to restrain unwanted negative coefficient estimates which occur in the least squares regression. Multicolinearity is one of the cases which causes this result. Robers and Ben-Israel (1969) by using interval linear programming, proposed an algorithm to solve the L 1 norm estimation problem. Rabinowitz (1970), Shanno and Weil (1970) discuss some connections between linear programming and the approximation problem. Barrodale (1970) summarizes the linear and nonlinear L 1 norm curve fitting on both continuous and discrete data. Spyropoulos and Kiountouzis and Young (1973) suggest two algorithms for fitting general functions and particularly fast algorithm with minimum storage requirements for fitting polynomials based on the algebraic properties of linear programming formulation. Robers and Robers (1973) have supplied a special version of the general method of Robers and Ben-Israel (1969), which is designed specifically for the L 1 norm problem. A related Fortran code is also provided. Barrodale and Roberts (1973) present a modification of the simplex method, which needs a smaller amount of storage, and by skipping over simplex vertices is more efficient than the usual simplex procedure. Define the vector ß as a difference of two nonnegative vectors c and d; their formulation can be stated as follows, Because of the relationships among variables, the computation can be performed by using only (n+2)x(m+2) amount of array storage, including labels for the basic and non-basic vectors. An initial basis is given by w if all y i are nonnegative. If a y i is negative, the sign of the corresponding row is changed, and the unit column from the corresponding element of v is taken as part of the basis. The algorithm is implemented in two stages. The first stage restricts the choice of the pivotal column during the first m iterations to the elements of the vector c j and d j according to the associated maximum nonnegative marginal costs. The vector that leaves the basis causes the maximum decrease in the objective function. Thus the pivot element is not necessarily the same as in the usual simplex. The second stage involves interchanging nonbasic w i or v i with the basic w i or v i . The basic vectors corresponding to c j and d j are not allowed to leave the basis. The algorithm terminates when all marginal costs are nonpositive (see, Kennedy and Gentle (1980)). Fortran code for this procedure is given by Barrodale and Roberts (1974). Peters and Willms (1983) give algorithms accompanying with computer codes for up-and-down dating the solution of the problem when a column or row inserted to or deleted from X, or y is changed. These algorithms are all based on Barrodale and Roberts (1973,74) procedure.
Abdelmalek (1974) describes a dual simplex algorithm for the L 1 norm problem with no use of artificial variables. For this algorithm, the Haar condition (see, Osborne (1985), Moroney (1961)) need not be satisfied anymore. This algorithm seemed to be very efficient at the time of publication. An improved dual simplex algorithm for L 1 norm approximation is proposed by Abdelmalek (1975a). In this algorithm, certain intermediate iterations are skipped, and in the case of illconditioned problems, the basis matrix can lend itself to triangular factorization and thus ensure a stable solution. Abdelmalek (1980a) improves his previous algorithm by using triangular decomposition. A Fortran translation of the algorithm is given by Abdelmalek (1980b). Sposito and McCormick and Kennedy (1975) summarize much of the works on L 1 norm estimation including problem statement, linear programming formulation, efficient computational algorithms, and properties of the estimators. Armstrong and Kung (1978) propose an algorithm for a simple two-parameter L 1 norm regression. The method is a specification of linear programming of Barrodale and Roberts (1973) algorithm. A Fortran code is given too. Armstrong and Frome and Kung (1979) use LU (Lower-Upper triangular) decomposition of Bartels and Golub (1969) in maintaining the current basis on the revised simplex procedure. A Fortran translation is also enclosed. Armstrong and Godfrey (1979) show that the primal method of Barrodale and Roberts (1973) and the dual method of Abdelmalek (1975a) are essentially equivalent. With a given initial basis for the two methods, they show that both algorithms will generate corresponding bases at each iteration. The only difference is the choice of initial basis and heuristic rules for breaking ties. Armstrong and Kung (1982b) present a dual linear programming formulation for the problem. Various basis entry and initialization procedures are considered. It has been shown that the dual approach is superior to primal one if a good dual feasible solution is readily available (see also, Steiger (1980)). Banks and Taylor (1980) suggest a modification of Barrodale and Roberts (1973) algorithm. The objective function is altered to include magnitudes of the elements of both errors and solution vectors. For a general discussion on simplex for piecewise linear programming see Fourer (1985a,b) and for a survey of the corresponding problem on the L 1 norm see Fourer (1986). Narula and Wellington (1987) propose an efficient linear programming algorithm to solve both L 1 and L ¥ norms linear multiple regressions. The algorithm exploits the special structure and similarities between the two problems. Brennan and Seiford (1987) develop a geometrical interpretation of linear programming in L 1 norm regression. They give a geometric insight into the solving process in the space of observations. McConnell (1987) shows how the method of vanishing Jacobians which has been used to optimize quadratic programming problems can also be used to solve the special linear programming problem associated with computing linear discrete L 1 norm approximation. For the possibility of applying other types of linear programming solutions such as Karmarkar solution to L 1 norm problem see Meketon (1986).

Other algorithms
This category consists of algorithms which were not classified in the two last sections.
Rice (1964c) applies the bisection method to L 1 norm regression. In this method at each step, the domain of S is broken to two segments, and the appropriate segment is selected for the next iteration. The solution is reached when the last segment is less than a predetermined small value. Abdelmalek (1971) develops an algorithm for fitting functions to discrete data points and solving the overdetermined system of linear equations. The procedure is based on determining L 1 norm solution as the limiting case of L p norm approximation when p tends to one from right in the limit. This technique thus obtains a solution to a linear problem by solving a sequence of nonlinear problems. Schlossmacher (1973) computed the L 1 norm estimates of regression parameters by an iterative weighted least squares procedure. Instead of minimizing the sum of absolute deviations, he minimized he sum of weighted squared errors with 1/│u i │ as weights. Once the least squares is applied to the problem and residuals are computed. The absolute value of the inverse of the residuals are again used as corresponding weights in the next iteration for minimizing the sum of weighted squared errors (see also, Holland and Welsh (1977)). Fair (1974) observed that the estimated values of ß did not change after the second or third iterations. In cases where any residual is zero, the continuation of the procedure is impossible, because the corresponding weight to this residual is infinite. This problem is also discussed by Sposito and Kennedy and Gentle (1977), Soliman and Christensen and Rouhi (1988). Absolute convergence of this algorithm has not been proved, but the non-convergent experiment has not been reported. Soliman and Christensen and Rouhi (1988) used left pseudoinverse (see, Dhrymes (1978) for a description of this inverse) to solve the general linear L 1 norm regression. According to this procedure, one should calculate the least squares solution using the left pseudoinverse or least squares approximation as ß^=(X T X) -1 X T y. Calculate the residuals as u=│y-Xß^│; where u is nx1 column vector. Select the m observations with the smallest absolute values of the residuals and partition the matrices as the selected observations locate on the top, , , .
Solve y^=X^ß^ for the top partitions as ß^=X^-1 y. Although this procedure is operationally simple, its solution is not the same as other exact methods, and no proof is presented to show that the solution is in the neighborhood of the exact solution of the L 1 norm minimization problem.