Comparative Study of the L1 Norm Regression Algorithms

This paper tries to compare more accurate and efficient L1 norm regression algorithms. Other comparative studies are mentioned, and their conclusions are discussed. Many experiments have been performed to evaluate the comparative efficiency and accuracy of the selected algorithms. 
 
 


Introduction
The objective of this paper is to compare some of the existing algorithms for the L1 norm regression with those proposed by Bidabad (1989a,b). Our point of view is to compare the accuracy and relative efficiencies of them. In this respect, accuracy of the solution of the algorithms is more important than the other criteria. By the term accuracy we mean, reaching the correct solution in a finite number of steps or iterations. By efficiency, we mean that the algorithm performs with a smaller amount of required storage and execution time to reach the accurate optimal solution.
Generally, the comparison of algorithms is not a straightforward task. As it is indicated by Dutter (1977), factors such as quality of computer codes and computing environment should be considered. In the case of the L1 norm algorithms, three specific factors of the number of observations, number of parameters, and the condition of data are more important. Kennedy and Gentle and Sposito (1977a,b), and Hoffman and Shier (1980a,b) describe methods for generating random test data with known L1 norm solution vectors. Gilsinn et al. (1977) discuss a general methodology for comparing the L1 norm algorithms. Kennedy and Gentle (1977) examine the rounding error of L1 norm regression and present two techniques for detecting inaccuracies of the computation (see also, Larson and Sameh (1980)).
Many authors have compared their own algorithms with those already proposed. Table 1 gives a summary of the characteristics of the algorithms proposed by different authors. It is important to note that since the computing environment and condition of data with respect to the distribution of the regression errors of the presented algorithms by table 1 are not the same, definitive conclusion and comparison should not be drawn from this table. Armstrong and Frome (1976a) compare the iterative weighted least squares of Schlossmacher (1973) with Barrodale and Roberts (1973) algorithm. The result was high superiority of the latter. Anderson and Steiger (1980) compare the algorithms of Bloomfield and Steiger (1980), Bartels and Conn and Sinclair (1978) and Barrodale and Roberts (1973). It was concluded that as the number of observations n increases the BR locates in a different complexity class than BCS and BS. All algorithms are linear in the number of parameters m, and BS is less complex than BCS. Complexities of BS and BCS are linear in n. There is a slight tendency for all algorithms to work proportionately harder for even m than for odd m. BR and BS had the most difficulty with normal error distribution and the least difficulty with Pareto distribution with corresponding Pareto density parameter equal to 1.2. Gentle and Narula and Sposito (1987) perform a full comparison among some of the L1 norm algorithms. They limited this comparison to the codes that are openly available for L1 norm linear regression of unconstrained form. Table 2 shows the required array storage and stopping constants of the corresponding algorithms. In their study, the problem consists of uniform (0,1) random values for X and normal (0,3) variates for the random error term. The value of the dependent variable y computed as the sum of the independent variables and error term. Summary of the results is shown in tables 3 and 4 for simple and multiple regressions, respectively. Values in the cells are the CPU time averages of 100 replications, and the values in the parentheses are corresponding maximum CPU time of the 100 replications. Gentle and Sposito and Narula (1988) also compare the algorithms for unconstrained L1 norm simple linear regression. This investigation is essentially an extraction of Gentle and Narula and Sposito (1987). The attained results are completely similar.
They concluded that the BS program performs quite well on smaller problems, but in larger cases, because of accumulated round-off error, it fails to produce correct answers. The Wesolowsky program was not usable and deleted in their study. Because of the superiority of AFK to BR and AK to S, which had been indicated in previous studies, BR and S algorithm did not enter in their study.  By considering all aspects, they concluded that AFK seems to be the best.

Design of experiments
Performance of every algorithm in any specific computing environment is different and thus makes the absolute comparison of algorithms very difficult, especially if the system uses, virtual or real storage, a cache or any array processor or mathematical co-processor and etc. As it was discussed by Bidabad (1989a,b), many algorithms exist for L1 norm regression with corresponding computer program and comparison of all of them is very costly. In order to reduce the number of experiments, we rely on the experience of previous researchers which were discussed above. However, the experiments are divided into two general categories of simple and multiple linear L1 norm regressions.
Despite the coded computer programs, computing environment, numbers of observations and parameters of the model and "condition" of data are the major sources of comparisons for performances of algorithms. Thus different sizes problems are to be tested in this section.
To judge the superiority of algorithms, there are many criteria. Accuracy and efficiency are basic ones. In the former, we are concerned with obtaining the true results in different samples, and in the latter, the computation time and storage requirement of the algorithms are compared.
To perform the experiments, once uniform random values selected for ßj in the following model, m yi =  jxij + ui i=1,...,n (1) j=1 Random values generated for xij and ui with five specifications of distributions. Uniform and normal random generators (given by Mojarrad (1977)) used to generate three uniforms and two normal sets of random data for each experiment. Generated uniform random deviates belong to the [-10,10], [-100,100] and [-1000,1000] intervals. Normal deviates have zero mean with 100 and 1000 variances. Values of yi computed for ßj, xij, and ui which had been generated as explained above. Values of 20, 50, 100, 500, 1000, 2000, 5000 and 10000 were used for the number of observations n and values of 2, 3, 4, 5, 7 and 10 selected for the number of parameters m.
Hence, for all of the five specifications of distribution of ui, and for all m and n, replication is done for each of the selected algorithms. Average and range of these five replications are reported for each m and n for each algorithm. In the case of simple regression number of replications is ten than five.
The programs were all compiled by Fortran IV, VS compiler, 1.3.0 level (May 1983) and 77 LANGLVL with 03 optimization level to reduce the coding inefficiencies. The programs were run on BASF 7.68 (MVS) computer. Since this machine is a multitasking system, swapping process should affect the execution time. When the system is running for more than one job, this effect can not be measured and removed completely. In order to filter the swapping time, Service Request Block (SRB) time has been reduced from the total Central Processing Unit (CPU) time. However, when the system is busy, this may not exhaust all the swapping times. It has been tried to run all comparable algorithms simultaneously, and also in one class of input with enough initiators and the same priority level to cause similar situations for all comparable submitted jobs. The pre-execution times of compilation and linkage-editor are excluded from all tested programs.

Comparison of the simple regression L1 norm algorithms
In this study, comparisons are limited to the algorithm 2 of Bidabad (1989a,b) and that proposed by Josvanger and Sposito (1983). Gentle and Narula and Sposito (1987) and Gentle and Sposito and Narula (1988) introduced the latter as the most efficient algorithm for simple linear L1 norm regression.  The amount of array storage requirement for these two programs is shown in Table 5. This table may  be compared with table 2 for other algorithms. None of the programs destroys the input data. Both programs have been coded in single precision. Table 6 shows the results of the experiments for simple linear L1 norm regression. The values reported in the cells of the table are the averages of ten replications CPU times in seconds with different random samples. The values in the parentheses are the corresponding minimum and maximum CPU times of the ten runs. Both algorithms converged and gave accurate results for all of the experiments.
As it is clear from table 6 in small samples, the computation times are not very different, though algorithm 2 is faster. In medium samples, this difference becomes significant, and in larger samples, algorithm 2 becomes strongly superior to that of Josvanger and Sposito (1983). Thus it can be concluded that algorithm 2 performs better than the other algorithms and may be used for applied work to achieve more efficiency.

Comparison of the multiple regression L1 norm algorithms
To compare algorithm 4 of Bidabad (1989a,b) with other algorithms, experiments have been limited to three algorithms which are more accurate and efficient among the others. These are algorithms of Barrodale and Roberts (1973,74) (BR), Bloomfield and Steiger (1980) (BS), Armstrong and Frome and Kung (1979) (AFK). Although, BS and AFK algorithms are faster than BR, the reason to select BR algorithm was that the other two algorithms failed to produce correct answers for larger samples (see, Gentle and Narula and Sposito (1987)).
The amount of array storage requirement for these programs is indicated in table 7. This table may be compared with table 2 for other algorithms. All programs have been coded in single precision. None of the programs destroys input data.  ( Tables 8 through 12 report the averages of five runs CPU times for different sample sizes and parameters. The values in the parentheses are minimum and maximum CPU times of replications. For the three parameters model, as it can be seen from table 8, the algorithm 4 is superior to other algorithms. In this case, the BS, AFK, and BR possess less efficiency, respectively. When the sample size is small, the difference is not large. In medium sample sizes, this difference is going to increase. In larger size experiments, algorithm 4 and BS have a small difference, but BR and AFK are far from them. In all cases, algorithm 4 is faster than the other algorithms. In the case of four parameters model as shown by table 9, though BS algorithm is competing with algorithm 4, this ordering remains unchanged, and algorithm 4 is again most efficient. The ranking of the selected algorithms is similar to that of three parameters experiments in all cases of small, medium, and larger sample sizes. When the number of parameters increased to five, BS algorithm failed to produce correct answers for sample sizes of 2000 and more. Gentle and Narula and Sposito (1987) also referred to the failure of BS algorithm for sample sizes of 1000 and greater for five and more parameters models and for a sample size of 5000 when the number of parameters is two. With reference to table 10, the efficiency of algorithm 4 to others with respect to the failure of BS is clear. The algorithms of AFK and BR are in the next positions, respectively. For smaller sample size, BR, BS, and AFK algorithms are competing, but the differences are very small. In the larger sample sizes, algorithm 4 becomes strictly superior to other algorithms.  In table 11, when the number of parameters is seven, BS algorithm failed to compute the correct answer for a sample of sizes 1000 and more. AFK is the best for smaller samples, but for large samples, algorithm 4 is again superior. BR algorithm is in the third position.
In table 12, with ten parameters, BS and AFK algorithms failed to compute correct answers for the larger sample sizes. BR algorithm is the most efficient with respect to accuracy. Algorithm 4 remains in the second position of both computing time and accuracy, except for sample size of 10000, where algorithm 4 is the most efficient.

Conclusions
Since in computational algorithms, accuracy is more important than efficiency, those L1 norm algorithms should be selected which produce correct solutions, and among them, the fastest one should be selected. Algorithm 2 and algorithm of Josvanger and Sposito (1983) both computed correct answers for two parameters linear L1 norm regression model. Algorithm 2, which is faster than JS introduced for applied works.
For multiple regression, BS and AFK algorithms failed to compute correct answers in larger models. As stated by Gentle and Narula and Sposito (1987), because of the accumulated roundoff error, algorithm of Bloomfield and Steiger (1980) was not usable in larger size problems. Coding to avoid rounding problems often increase the execution time, so it is not clear what would happen to the relative efficiency if the BS code is modified. This is also the case for the algorithm of Armstrong and Frome and Kung (1979), though it is less sensitive to rounding error than BS algorithm. However, from the previous tables, it may be concluded that algorithm 4 is more appropriate for models with less than ten parameters and algorithm of Barrodale and Roberts (1973,74) for the ten parameters model. This last conclusion is not very constructive, because in the case of ten parameters model with 10000 observations algorithm 4 is highly superior to BR. However, since in applied work, we are not always confronted with a very large amount of data and parameters, this conclusion is poor in an operational sense.