Robust techniques for linear regression with multicollinearity and outliers
The ordinary least squares (OLS) method is the most commonly used method in multiple linear regression model due to its optimal properties and ease of computation. Unfortunately, in the presence of multicollinearity and outlying observations in a data set, the OLS estimate is inefficient with inflat...
Saved in:
Main Author: | |
---|---|
Format: | Thesis |
Language: | English |
Published: |
2016
|
Subjects: | |
Online Access: | http://psasir.upm.edu.my/id/eprint/58669/1/IPM%202016%201IR%20D.pdf |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | The ordinary least squares (OLS) method is the most commonly used method in multiple linear regression model due to its optimal properties and ease of computation. Unfortunately, in the presence of multicollinearity and outlying observations in a data set, the OLS estimate is inefficient with inflated standard errors. Outlying observations can be classified into different types, such as vertical outlier, high leverage points (HLPs) and influential observations (IO). It is very crucial to identify HLPs and IO because of their responsibility for having large effect on various estimators, causing masking and swamping of outliers in multiple linear regression. All the commonly used diagnostic measures fail to correctly identify those observations. Hence, a new improvised diagnostic robust generalized potential (IDRGP) is proposed. The proposed IDRGP is very successful in detecting multiple HLPs with smaller masking and swamping rates. This thesis also concerned on the diagnostic measures for the identification of bad influential observations (BIO). The detection of BIO is very important because it is accountable for inaccurate prediction and invalid inferential statements as it has large impact on the computed values of various estimates. The Generalized version of DFFITS (GDFF) was developed only to identify IO without taking into consideration whether it is good or bad influential observations. In addition, although GDFF can detect multiple IO, it has a tendency to detect lesser IO as it should be due to swamping and masking effect. A new proposed method which is called the modified generalized DFFITS (MGDFF) is developed in this regard, whereby the suspected HLPs in the initial subset are identified using our proposed IDRGP diagnostic method. To the best of our knowledge, no research is done on the classification of observations into regular, good and bad IOs. Thus, the IDRGP-MGDFF plot is formulated to close the gap in the literature. This thesis also addresses the issue of multicollinearity problem in multiple linear regression models with regards to two sources. The first source is due to HLPs and the second source of multicollinearity problem is caused by the data collection method employed, constraints on the model or in the population,model specification and an over defined model. However, no research is focused on the parameter estimation method to remedy the problem of multicollinearity which is due to multiple HLPs. Hence, we propose a new estimation method namely the modified GM-estimator (MGM) based on MGDFF. The results of the study indicate that the MGM estimator is the most efficient method to rectify the problem of multicollinearity which is caused by HLPs. When multicollinearity is due to other sources (not HLPs), several classical methods are available. Among them, the Ridge Regression (RR), Jackknife Ridge Regression (JRR) and Latent Root Regression (LRR) are put forward to remedy this problem. Nevertheless, it is now evident that these classical estimation methods perform poorly when outliers exist in a data. In this regard, we propose two types of robust estimation methods. The first type is an improved version of the LRR to rectify the simultaneous problems of multicollinearity and outliers. The proposed method is formulated by incorporating robust MM-estimator and the modified generalized M-estimator (MGM) in the LRR algorithm. We call these methods the Latent Root MMbased (LRMMB) and the Latent Root MGM-based (LRMGMB) methods. Similar to the first type, the second type of robust multicollinearity estimation method also aims to improve the performance of the robust jackknife ridge regression. The MM-estimator and the MGM-estimator are integrated in the JRR algorithm for the establishment of the improved versions of JRR. The suggested method is called jackknife ridge MM-based denoted by JRMMB and the jackknife ridge MGM based denoted by JRMGMB. All the proposed methods outperform the commonly used methods when multicollinearity comes together with the existence of multiple HLPs. The classical multicollinearity diagnostic measure is not suitable to correctly diagnose the existence of multicollinearity in the presence of multiple HLPs. When the classical VIF is employed, HLPs will be responsible for the increased and decreased of multicollinearity pattern. This will give misleading conclusion and incorrect indicator for solving multicollinearity problem. In this respect, we propose robust VIF denoted as RVIF(JACK-MGM) which serves as good indicator that can help statistics practitioners to choose appropriate estimator to solve multicollinearity problem. |
---|