COMPARATIVE STUDY OF SOME ESTIMATORS OF LINEAR REGRESSION MODELS IN THE PRESENCE OF OUTLIERS

The paper examined the performance of five estimation methods using six different outlier percentages (0%, 5%, 10%, 20%, 30% and 40%) and five different sample sizes (20, 40, 60, 100 and 200) were used to investigate effect of sample size on the performance of each of the estimation methods. The study adopted absolute bias, variances, relative efficiency and root mean square errors as comparison criteria through Monte-Carlo experiment and real life data was used to validate the simulation results. The study found that, under 5%, 10%, 20% and 30% outlying condition Robust-MM is the most preferred estimator across all criteria and sample size except using relative efficiency criterion and when the sample size is 40, 200 and 200 under 5%, 20% and 30% outlying condition and using absolute bias criterion respectively while Robust-LTS is the least preferred estimator except when the sample size is 40, 20 ; 40, 20 ; 20, 200 under 5%, 20% and 30% outliers and using absolute bias, variance and root mean square error respectively. Under 40% outlying condition Robust-MM is the most preferred estimator across all criteria and sample size except using relative efficiency and when the sample size is 20. Furthermore, Robust-MM is the most consistent estimator across the comparison criteria except when using relative efficiency and sample size has little or no effect on the performance of the estimators across all the different outlier levels. R Statistical package was used for the data analysis. This study therefore recommends the used of Robust-MM estimator.


INTRODUCTION
defined Regression analysis as a statistical technique for analyzing and modeling the relationship between dependent variable and one or more independent variables. This technique uses the mathematical equation to establish the relationship between variables. It is a predictive modeling technique used for forecasting and to find causal effect relationship between the variables. Outlier is a data point that differs significantly from other observations. An outlier may be due to variability in the measurement or it may indicate experimental error, the latter are sometimes excluded from the dataset. An outlier can cause serious problem in statistical analysis (Zimek and Filzsomer 2018). Outlier detection has many applications, such as data cleaning, fraud detection and network intrusion. The existence of outliers can indicate individuals or groups that have behavior very different from most of the dataset. Frequently, outliers are removed to improve accuracy of the estimators. But sometimes the presence of an outlier has a certain meaning, which explanation can be lost if the outlier is deleted (Hawkins 1980).

The Classical Linear Regression Model
The classical linear regression model is a statistical model that describes a data generation process. The classical linear regression model of the form (1) Where Y is an vector of observed response values, X is the matrix of the predictor variables, is the vector which contains the unknown parameters and needs to be estimated, and is the vector of random error terms.

Assumptions of classical linear regression model
(1) The dependent variable is linearly related to the coefficients of the model and the model is correctly specified. i.e. Y = Xβ +ε (2) The independent variable(s) is/are uncorrelated with the equation error term. i.e. Cov(X,ε) =0 (3) The mean of the error term is zero. i.e. E(ε) = 0 (4) The error term has a constant variance (Homoscedastic error). i.e. var (ε) = σ 2 I (5) The error terms are uncorrelated with each other. No autocorrelation or serial correlation. i.e. cov(εi,εj) = 0 for all (6) There is no perfect linear relationship between the independent variables. (7) The error term is normally distributed. (8) There is absence of outlier in the dataset Applying Ordinary Least Squares Estimators (OLSE) in simple or multiple linear regressions always calls for some assumptions: normality of the error terms; equal variance of the error terms, and absence of outliers, leverage points and Multicollinearity. According to Hampel (2001) and Huber (1972), normality of the error distributions finds its basis from the central limit theorem; which is a limit theorem based on approximations. Additionally, outliers in the dependent variable, lead to large residual values which further results in the failure of the normality assumption of the error terms. Therefore, in regression analysis, the ordinary Least Squares estimation is the best method if the assumptions are met. However, if these assumptions are not satisfied, the results can easily be affected (Alma, 2011). One of the first steps towards obtaining a coherent analysis is the detection of outlaying observations. Although outliers are often considered as an error or noise, they may carry important information. Detected outliers are candidates for aberrant data that may otherwise adversely lead to model misspecification, biased parameter estimation and incorrect results. It is therefore important to identify them prior to modeling and analysis (Williamset.al 2002;Liu et.al 2004). Among methods used in detecting the presence of outlier are graphical methods and scatter plot. In the situation where the assumptions of the linear regression are not met, robust regression estimator is an important estimation technique for analyzing data that are contaminated with outliers or data with non normal error term. Many estimation methods such as Least Trimmed Squares Estimator (LTSE), M-Estimator (ME), S-Estimator (SE), Modified Maximum Likelihood Estimator (MMLE) have been proposed which are more efficient than the Ordinary Least Squares (OLS) when there is outlier. In regression analysis, the application of ordinary least squares method works well if the assumptions of the regression model, variables and the error terms are met. However, the presence of outliers or failure of the assumptions renders the ordinary least squares method of estimation unreliable. This is because bad leverage points, vertical outliers and good leverage points can influence the coefficients in the model, the residuals, as well as the standard errors of the model and the coefficients (David, 2014). Several estimation procedures have been proposed in literature to handle the problem of outliers during parameter estimation. Therefore, the paper examined the performances of five robust estimators using different percentages of outlier conditions with varying sample sizes say; 20, 40, 60, 100 and 200 and identified the most preferred estimator based on each of the selected comparison criteria.

Least Absolute Deviation (LAD)
This estimator obtains a higher efficiency, instead of minimizing the sum of squared errors; it minimizes the sum of absolute values of errors. The LAD method is not sensitive to outliers and produces robust estimates, (DasGupta and Mishra, 2004). (2) (3) M-Estimator (ME) One of the robust regression estimation methods is the M estimation. M estimation is an estimation of the maximum likelihood type. M-estimation is an extension of the maximum likelihood estimate method and a robust estimation.
(4) Rousseeuw (1984) developed the least trimmed squares estimator (LTSE) given by, (5) Where = [ (1 − ) + 1] is the number of observations included in the calculation of the estimator, and α is the proportion of trimming that is performed. Using ensures that the estimator has a breakdown point of 50%.

S -Estimation (SE)
The S-estimation is a high breakdown method introduced by Rousseeuw and Yohai (1984) that minimizes the dispersion of the residuals. The S-estimator was introduced to take care of the low breakdown point of the M-estimators.
(6) MM-Estimation (MME) MM estimation is a special type of M-estimation developed by Yohai (1987). MM-estimation is a combination of high breakdown value estimation and efficient estimation. MM estimator was the first estimate with a high breakdown point and high efficiency under normal error. (7)

MATERIALS AND METHOD Data Generation and Model Formulation Procedure
The dataset used for this study was simulated using Monte-Carlo in the environment of R statistical package (www.cran.org).

Mechanism for Generating the Independent Variables
In this study, three regressors were simulated, where two of the regressors were simulated to be normally distributed with mean zero and variance 1 and the other one was simulated with different percentages of outlier. The procedure is: Where t = 1,2, 3,...,n;i= 1,2.

Mechanism for Generating the Dependent Variable
The dependent variable was simulated to be distributed according to a Gaussian mixture, i.e ~(1 − 1%) (0,1) + 1% (0,500) The response variable is obtained from the relation given by:

Data Simulation
A Monte-Carlo experiment of 1000 trials was carried out for five sample sizes (20, 40, 60, 100 and 200) each with different percentage of outliers (0%, 5%, 10%, 20%, 30%, 40%). Five Robust estimators were used to estimate parameters that were fitted to this simulated data. Real life data was used to validate our findings from the simulation study.

Criteria for Evaluating the Estimators
The assessments of the estimators considered in this work was based on the following criteria.

Root Mean Square Error (RMSE)
Root Mean Square Error (RMSE) is the standard deviation of the residuals (prediction errors). The formula is given by: The bias is measured by Bias= ̂− The smaller the bias of an estimator the better.

Variance
The variance of an estimator ̂ for β is defined as (10) Relative efficiency of an estimator The efficiency of an estimator is its minimum possible variance to its actual variance, and this estimator is efficient when the ratio gives one 1.

Robust-LTS 4
Robust-LAD 2 Going by variance criterion, the estimator that best fit the dataset as evident from Table 11 is Robust-LTS estimator. Similarly, from Table 11 above it can be observed that Robust-M estimator is the best estimator in terms of predictions.
Based on the results about the estimators presented in Table 1 through Table 8, the paper found that under 0% outlying condition and using absolute bias, variance and root mean square error criterions, Robust-M is the most preferred estimator while Robust-LTS is the least Preferred at all sample sizes (20, 40, 60, 100, and 200). However, under 0% outlying condition and using Relative Efficiency criterion, Robust-LTS is the most preferred Estimator while Robust-M is the least preferred estimator at all sample sizes (20, 40, 60, 100, and 200). In addition, under 5% outlying condition and using absolute bias, variance and root mean square error criterion, Robust -MM is the most preferred estimator except when the sample size is 40 using absolute bias criteria while Robust-LTS is the least preferred Estimator except when the sample size is 40 using absolute bias, 20 and 40 using variance and 20 using root mean square error criterion respectively. However, under 5% outlying condition and using relative efficiency, Robust-LTS is the most preferred estimator across all sample size while Robust-MM is the least preferred estimator across all sample size except when the sample size is 200. Under 10% outlying condition and using absolute bias, variance and root mean square error criteria, Robust-MM is the most preferred estimator while Robust-LTS is the least preferred Estimator across all sample size. However, under 10% outlying condition and using relative efficiency, Robust-LTS is the most preferred estimator while Robust-MM is the least preferred estimator across all sample size except when the sample size. However, under 20% outlying condition and using absolute bias, variance and root mean square error criteria, Robust -MM is the most preferred estimator across all sample size except when the sample size is 200 using absolute bias criterion while Robust-LTS is the least preferred Estimator across all sample size except when the sample size is 20.
Consequently, under 20% outlying condition and using relative efficiency, Robust-LTS is the most preferred estimator except when the sample size is 20 while Robust-MM is the least preferred estimator across all sample size. Using 30% outlying condition and using absolute bias, variance and root mean square error criteria, Robust -MM is the most preferred estimator across all sample size except when the sample size is 200 using absolute bias criterion while Robust-LTS is the least preferred Estimator across all sample size except when the sample size is 200 using root mean square error. However, under 30% outlying condition and using relative efficiency, Robust-M and Robust-LTS are the most preferred estimators when the sample sizes are 20, 40 and 60 and 100 and 200 respectively while Robust-MM is the least preferred estimator across all sample size. Going 40% outlying condition and using absolute bias, variance and Root Mean Square Error criteria, Robust -MM is the most preferred estimator across all sample size except when the sample size is 20 using Root Mean Square Error criterion while Robust-M is the least preferred Estimator across all sample size. Also, under 40% outlying condition and using relative efficiency, Robust-M is the most preferred estimators except when the sample size 100 while Robust-LTS and Robust-MM is the least preferred estimator at sample sizes 20 and 40, 60 and 200 respectively. That sample size has little or no effect on the performance of the estimators across all the different outlier levels.

CONCLUSION
In conclusion, the study concludes that Robust-M is the most efficient estimator across all the comparison criteria in the absence of outlier except when using relative efficiency and 376 ©2022 This is an Open Access article distributed under the terms of the Creative Commons Attribution 4.0 International license viewed via https://creativecommons.org/licenses/by/4.0/ which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is cited appropriately.
that Robust-MM is the most consistent estimator across the comparison criteria except when using relative efficiency. Also, sample size has little or no effect on the performance of the estimators across all the different outlier levels.