Robust Regression with Continuous and Categorical Variables Having Heteroscedastic Non-Normal Errors

The performance of the classical Ordinary Least Squares (OLS) method can be very poor when the data set for which one often makes a normal assumption, has a heavy- tailed distribution which may arise as a result of outliers. The problem is further complicated when the variances of the error terms a...

Full description

Saved in:
Bibliographic Details
Main Author: Majeed Al-Talib, Bashar Abdul Aziz
Format: Thesis
Language:English
English
Published: 2006
Subjects:
Online Access:http://psasir.upm.edu.my/id/eprint/546/1/600391_fs_2006_52_abstrak_je__dh_pdf_.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
id my-upm-ir.546
record_format uketd_dc
institution Universiti Putra Malaysia
collection PSAS Institutional Repository
language English
English
topic Regression analysis
Heteroscedasticity
Analysis of variance
spellingShingle Regression analysis
Heteroscedasticity
Analysis of variance
Majeed Al-Talib, Bashar Abdul Aziz
Robust Regression with Continuous and Categorical Variables Having Heteroscedastic Non-Normal Errors
description The performance of the classical Ordinary Least Squares (OLS) method can be very poor when the data set for which one often makes a normal assumption, has a heavy- tailed distribution which may arise as a result of outliers. The problem is further complicated when the variances of the error terms are not constant. In this thesis, a Reweighted Least Squares based on Robust Distance and a combination between the S and M-estimates (called S/M Estimates). The resulted estimates will be called (RLSRDSM) estimates, which proposed to overcome the problem of outlier. The RLSRDSM estimates are proposed to estimates the parameters of a regression model with both continuous and categorical variables. The Robust Distance S/M Estimates are computed in three stages where on the first stage the Minimum Volume Ellipsoid (MVE) estimator is computed to identify leverage points, then a weighted S/M weights and scale is calculated in the second and third steps respectively. In many applications, one may encounter errors which are heteroscedastic and not normally distributed. Therefore, in this thesis, a weighted RLSRDSM (WRLSRDSM) is proposed to remedy these two problems simultaneously. This method first computes the residuals scale estimates for each level of the categorical variables based on RDSM residuals. A weighted scheme is then developed and incorporates in the model. In addition to RLSRDSM and WRLSRDSM, another estimator which is referred as 2D-RDLS procedure that use two-dimensional weighting scheme is also proposed. However, the performance of the 2D-RDLS estimates is not as good as the RLSRDSM and therefore seldom referred in the discussion. A number of numerical examples and simulation studies have been performed to compare the robustness of the RLSRDSM, and WRLSRDSM with some existing methods in the regression model with both continuous and categorical regressors. Data with various outlier contamination were simulated and analyzed. Design parameters were varied, include sample size (n=20, 50, 100, 300, and 500), number of continuous regressors (p=1,3, and 5) and categorical data (q=1,4), outliers density (0%, 5%, 10%, 20%, 30%, 40%, and 50%), and different error distribution scenarios (N(0,0.25), N(0,0.5), N(0,1), N(0,2), N(0,3), N(0,4), t(3), and EXP(1)). Criteria used to measure the performance of the regression methods are p-values, residual scale, %, and %for the real data analysis and The Root Mean Square Error (RMSE) of the overall simulation replications which summarized the variance and bias for the simulated data. The results in this thesis indicate that the Ordinary Least Squares (OLS) estimators are very sensitive to the presence of outliers and heteroscedastic errors. In the presence of outliers, the RLSRDSM and RLSRDL1 are better than OLS, by producing robust estimates to such kind of data points. The RLSRDSM is slightly better than RLSRDL1 and sometimes their performances are indistinguishable in the presence of outliers. Nonetheless, the RLSRDL1 posed certain computational problems such as producing degenerate solution or singular matrices. The advantage of RLSRDSM is that it has no computational problems. The performance of WRLSRDSM is better than the WLSRDL1 when both outliers and heteroscedastics occurs together. In order to support the numerical findings, Bootstrap simulation procedures and visual analysis are also been carried out to justify that the RLSRDSM is the most robust estimator compared to the OLS and RLSRDL1, on a ground that this estimator result with robust and stable in the presence of outliers, combined models with continuous and categorical variables, and even heteroscedasticity problem. The results indeed show that they are in close agreement with the earlier conclusion.
format Thesis
qualification_name Doctor of Philosophy (PhD.)
qualification_level Doctorate
author Majeed Al-Talib, Bashar Abdul Aziz
author_facet Majeed Al-Talib, Bashar Abdul Aziz
author_sort Majeed Al-Talib, Bashar Abdul Aziz
title Robust Regression with Continuous and Categorical Variables Having Heteroscedastic Non-Normal Errors
title_short Robust Regression with Continuous and Categorical Variables Having Heteroscedastic Non-Normal Errors
title_full Robust Regression with Continuous and Categorical Variables Having Heteroscedastic Non-Normal Errors
title_fullStr Robust Regression with Continuous and Categorical Variables Having Heteroscedastic Non-Normal Errors
title_full_unstemmed Robust Regression with Continuous and Categorical Variables Having Heteroscedastic Non-Normal Errors
title_sort robust regression with continuous and categorical variables having heteroscedastic non-normal errors
granting_institution Universiti Putra Malaysia
granting_department Faculty of Science
publishDate 2006
url http://psasir.upm.edu.my/id/eprint/546/1/600391_fs_2006_52_abstrak_je__dh_pdf_.pdf
_version_ 1747810245753700352
spelling my-upm-ir.5462013-05-27T06:49:10Z Robust Regression with Continuous and Categorical Variables Having Heteroscedastic Non-Normal Errors 2006-08 Majeed Al-Talib, Bashar Abdul Aziz The performance of the classical Ordinary Least Squares (OLS) method can be very poor when the data set for which one often makes a normal assumption, has a heavy- tailed distribution which may arise as a result of outliers. The problem is further complicated when the variances of the error terms are not constant. In this thesis, a Reweighted Least Squares based on Robust Distance and a combination between the S and M-estimates (called S/M Estimates). The resulted estimates will be called (RLSRDSM) estimates, which proposed to overcome the problem of outlier. The RLSRDSM estimates are proposed to estimates the parameters of a regression model with both continuous and categorical variables. The Robust Distance S/M Estimates are computed in three stages where on the first stage the Minimum Volume Ellipsoid (MVE) estimator is computed to identify leverage points, then a weighted S/M weights and scale is calculated in the second and third steps respectively. In many applications, one may encounter errors which are heteroscedastic and not normally distributed. Therefore, in this thesis, a weighted RLSRDSM (WRLSRDSM) is proposed to remedy these two problems simultaneously. This method first computes the residuals scale estimates for each level of the categorical variables based on RDSM residuals. A weighted scheme is then developed and incorporates in the model. In addition to RLSRDSM and WRLSRDSM, another estimator which is referred as 2D-RDLS procedure that use two-dimensional weighting scheme is also proposed. However, the performance of the 2D-RDLS estimates is not as good as the RLSRDSM and therefore seldom referred in the discussion. A number of numerical examples and simulation studies have been performed to compare the robustness of the RLSRDSM, and WRLSRDSM with some existing methods in the regression model with both continuous and categorical regressors. Data with various outlier contamination were simulated and analyzed. Design parameters were varied, include sample size (n=20, 50, 100, 300, and 500), number of continuous regressors (p=1,3, and 5) and categorical data (q=1,4), outliers density (0%, 5%, 10%, 20%, 30%, 40%, and 50%), and different error distribution scenarios (N(0,0.25), N(0,0.5), N(0,1), N(0,2), N(0,3), N(0,4), t(3), and EXP(1)). Criteria used to measure the performance of the regression methods are p-values, residual scale, %, and %for the real data analysis and The Root Mean Square Error (RMSE) of the overall simulation replications which summarized the variance and bias for the simulated data. The results in this thesis indicate that the Ordinary Least Squares (OLS) estimators are very sensitive to the presence of outliers and heteroscedastic errors. In the presence of outliers, the RLSRDSM and RLSRDL1 are better than OLS, by producing robust estimates to such kind of data points. The RLSRDSM is slightly better than RLSRDL1 and sometimes their performances are indistinguishable in the presence of outliers. Nonetheless, the RLSRDL1 posed certain computational problems such as producing degenerate solution or singular matrices. The advantage of RLSRDSM is that it has no computational problems. The performance of WRLSRDSM is better than the WLSRDL1 when both outliers and heteroscedastics occurs together. In order to support the numerical findings, Bootstrap simulation procedures and visual analysis are also been carried out to justify that the RLSRDSM is the most robust estimator compared to the OLS and RLSRDL1, on a ground that this estimator result with robust and stable in the presence of outliers, combined models with continuous and categorical variables, and even heteroscedasticity problem. The results indeed show that they are in close agreement with the earlier conclusion. Regression analysis Heteroscedasticity Analysis of variance 2006-08 Thesis http://psasir.upm.edu.my/id/eprint/546/ http://psasir.upm.edu.my/id/eprint/546/1/600391_fs_2006_52_abstrak_je__dh_pdf_.pdf application/pdf en public phd doctoral Universiti Putra Malaysia Regression analysis Heteroscedasticity Analysis of variance Faculty of Science English