An efficient clustering algorithm in the presence on outlier and doubtful data

v ABSTRACT The presence of outlying observations is a common problem in most statistical analysis. This case is also true when using cluster analysis techniques. Cluster analysis basically detects homogeneous clusters with large heterogeneity among them. To deal with outliers, a correct procedure in...

全面介紹

Saved in:
書目詳細資料
主要作者: Md. Jedi, Muhamad Alias
格式: Thesis
語言:English
出版: 2015
主題:
在線閱讀:http://eprints.utm.my/id/eprint/79401/1/MuhamadAliasPFS2015.pdf
標簽: 添加標簽
沒有標簽, 成為第一個標記此記錄!
id my-utm-ep.79401
record_format uketd_dc
spelling my-utm-ep.794012018-10-16T07:30:33Z An efficient clustering algorithm in the presence on outlier and doubtful data 2015 Md. Jedi, Muhamad Alias QA Mathematics v ABSTRACT The presence of outlying observations is a common problem in most statistical analysis. This case is also true when using cluster analysis techniques. Cluster analysis basically detects homogeneous clusters with large heterogeneity among them. To deal with outliers, a correct procedure in cluster analysis is needed because usually outliers may appear joined together, which may lead to the wrong structure of clusters. New method of trimming in clustering (TCLUST) known as RTCLUST is proposed in this research that uses some information from TCLUST, partition around medoid (PAM), doubtful cluster and local outlier factor (LOF). TCLUST is a clustering method with constraint on the covariance matrices. For this case the constraint used was the eigenvalues. Spurious outlier model explains how to use the eigenvalues ratio, c for good clustering method. Good clustering is obtained using mean of discriminant. The value of c = 50 is obtained as a better value compared to the previous study c = 1. Trimmed likelihood is then used to determine the trimming proportion, α and number of clusters, k. The next procedure combines the TCLUST and PAM, which is known as MPAM. PAM is used because the mean of silhouette explains the clustering much better. The information obtained from MPAM are c = 50, α , and k. Different sample sizes are also used to test the suitability of MPAM. Mean of discriminant and mean of silhouette are then used to measure the strength of clustering. Trimmed likelihood curve is used again to check the values of α , and k. For the next step, using the doubtful cluster method with c = 50, the method shows the overlapping outliers that exist between clusters. In this case, the data in the overlapping area are classified as doubtful outliers and it is decided that the best threshold is 0.1. Lastly, the LOF is used to differentiate between doubtful outliers and real outliers in overlapping areas. Since LOF can detect real outliers, the deletion of this outlier is mandatory. Again, the mean of discriminant and mean of silhouette are obtained after the deletion of real outliers. A trimmed likelihood curve is then used to obtain the final value for α and k. This new procedure of RTCLUST uses c = 50 and threshold value equals 0.1 to obtain the mean of discriminant and mean of silhouette. To justify RTCLUST, medium sample size with Monte Carlo simulation is done to check the right possibility of combining methods, and therefore the normality of RTCLUST can be checked. Results found that the normality assumption for RTCLUST is fulfilled and Bayesian test can be used to significantly decide the value of k. Results for RTCLUST with having the lowest RMSE value shows that it is better than MPAM and TCLUST for both simulation and real data. 2015 Thesis http://eprints.utm.my/id/eprint/79401/ http://eprints.utm.my/id/eprint/79401/1/MuhamadAliasPFS2015.pdf application/pdf en public phd doctoral Universiti Teknologi Malaysia, Faculty of Science Faculty of Science
institution Universiti Teknologi Malaysia
collection UTM Institutional Repository
language English
topic QA Mathematics
spellingShingle QA Mathematics
Md. Jedi, Muhamad Alias
An efficient clustering algorithm in the presence on outlier and doubtful data
description v ABSTRACT The presence of outlying observations is a common problem in most statistical analysis. This case is also true when using cluster analysis techniques. Cluster analysis basically detects homogeneous clusters with large heterogeneity among them. To deal with outliers, a correct procedure in cluster analysis is needed because usually outliers may appear joined together, which may lead to the wrong structure of clusters. New method of trimming in clustering (TCLUST) known as RTCLUST is proposed in this research that uses some information from TCLUST, partition around medoid (PAM), doubtful cluster and local outlier factor (LOF). TCLUST is a clustering method with constraint on the covariance matrices. For this case the constraint used was the eigenvalues. Spurious outlier model explains how to use the eigenvalues ratio, c for good clustering method. Good clustering is obtained using mean of discriminant. The value of c = 50 is obtained as a better value compared to the previous study c = 1. Trimmed likelihood is then used to determine the trimming proportion, α and number of clusters, k. The next procedure combines the TCLUST and PAM, which is known as MPAM. PAM is used because the mean of silhouette explains the clustering much better. The information obtained from MPAM are c = 50, α , and k. Different sample sizes are also used to test the suitability of MPAM. Mean of discriminant and mean of silhouette are then used to measure the strength of clustering. Trimmed likelihood curve is used again to check the values of α , and k. For the next step, using the doubtful cluster method with c = 50, the method shows the overlapping outliers that exist between clusters. In this case, the data in the overlapping area are classified as doubtful outliers and it is decided that the best threshold is 0.1. Lastly, the LOF is used to differentiate between doubtful outliers and real outliers in overlapping areas. Since LOF can detect real outliers, the deletion of this outlier is mandatory. Again, the mean of discriminant and mean of silhouette are obtained after the deletion of real outliers. A trimmed likelihood curve is then used to obtain the final value for α and k. This new procedure of RTCLUST uses c = 50 and threshold value equals 0.1 to obtain the mean of discriminant and mean of silhouette. To justify RTCLUST, medium sample size with Monte Carlo simulation is done to check the right possibility of combining methods, and therefore the normality of RTCLUST can be checked. Results found that the normality assumption for RTCLUST is fulfilled and Bayesian test can be used to significantly decide the value of k. Results for RTCLUST with having the lowest RMSE value shows that it is better than MPAM and TCLUST for both simulation and real data.
format Thesis
qualification_name Doctor of Philosophy (PhD.)
qualification_level Doctorate
author Md. Jedi, Muhamad Alias
author_facet Md. Jedi, Muhamad Alias
author_sort Md. Jedi, Muhamad Alias
title An efficient clustering algorithm in the presence on outlier and doubtful data
title_short An efficient clustering algorithm in the presence on outlier and doubtful data
title_full An efficient clustering algorithm in the presence on outlier and doubtful data
title_fullStr An efficient clustering algorithm in the presence on outlier and doubtful data
title_full_unstemmed An efficient clustering algorithm in the presence on outlier and doubtful data
title_sort efficient clustering algorithm in the presence on outlier and doubtful data
granting_institution Universiti Teknologi Malaysia, Faculty of Science
granting_department Faculty of Science
publishDate 2015
url http://eprints.utm.my/id/eprint/79401/1/MuhamadAliasPFS2015.pdf
_version_ 1747818219119312896