An efficient clustering algorithm in the presence on outlier and doubtful data

v ABSTRACT The presence of outlying observations is a common problem in most statistical analysis. This case is also true when using cluster analysis techniques. Cluster analysis basically detects homogeneous clusters with large heterogeneity among them. To deal with outliers, a correct procedure in...

全面介紹

Saved in:

書目詳細資料
主要作者:	Md. Jedi, Muhamad Alias
格式:	Thesis
語言:	English
出版:	2015
主題:	QA Mathematics
在線閱讀:	http://eprints.utm.my/id/eprint/79401/1/MuhamadAliasPFS2015.pdf
標簽:	添加標簽沒有標簽, 成為第一個標記此記錄!

id	my-utm-ep.79401
record_format	uketd_dc
spelling	my-utm-ep.794012018-10-16T07:30:33Z An efficient clustering algorithm in the presence on outlier and doubtful data 2015 Md. Jedi, Muhamad Alias QA Mathematics v ABSTRACT The presence of outlying observations is a common problem in most statistical analysis. This case is also true when using cluster analysis techniques. Cluster analysis basically detects homogeneous clusters with large heterogeneity among them. To deal with outliers, a correct procedure in cluster analysis is needed because usually outliers may appear joined together, which may lead to the wrong structure of clusters. New method of trimming in clustering (TCLUST) known as RTCLUST is proposed in this research that uses some information from TCLUST, partition around medoid (PAM), doubtful cluster and local outlier factor (LOF). TCLUST is a clustering method with constraint on the covariance matrices. For this case the constraint used was the eigenvalues. Spurious outlier model explains how to use the eigenvalues ratio, c for good clustering method. Good clustering is obtained using mean of discriminant. The value of c = 50 is obtained as a better value compared to the previous study c = 1. Trimmed likelihood is then used to determine the trimming proportion, α and number of clusters, k. The next procedure combines the TCLUST and PAM, which is known as MPAM. PAM is used because the mean of silhouette explains the clustering much better. The information obtained from MPAM are c = 50, α , and k. Different sample sizes are also used to test the suitability of MPAM. Mean of discriminant and mean of silhouette are then used to measure the strength of clustering. Trimmed likelihood curve is used again to check the values of α , and k. For the next step, using the doubtful cluster method with c = 50, the method shows the overlapping outliers that exist between clusters. In this case, the data in the overlapping area are classified as doubtful outliers and it is decided that the best threshold is 0.1. Lastly, the LOF is used to differentiate between doubtful outliers and real outliers in overlapping areas. Since LOF can detect real outliers, the deletion of this outlier is mandatory. Again, the mean of discriminant and mean of silhouette are obtained after the deletion of real outliers. A trimmed likelihood curve is then used to obtain the final value for α and k. This new procedure of RTCLUST uses c = 50 and threshold value equals 0.1 to obtain the mean of discriminant and mean of silhouette. To justify RTCLUST, medium sample size with Monte Carlo simulation is done to check the right possibility of combining methods, and therefore the normality of RTCLUST can be checked. Results found that the normality assumption for RTCLUST is fulfilled and Bayesian test can be used to significantly decide the value of k. Results for RTCLUST with having the lowest RMSE value shows that it is better than MPAM and TCLUST for both simulation and real data. 2015 Thesis http://eprints.utm.my/id/eprint/79401/ http://eprints.utm.my/id/eprint/79401/1/MuhamadAliasPFS2015.pdf application/pdf en public phd doctoral Universiti Teknologi Malaysia, Faculty of Science Faculty of Science
institution	Universiti Teknologi Malaysia
collection	UTM Institutional Repository
language	English
topic	QA Mathematics
spellingShingle	QA Mathematics Md. Jedi, Muhamad Alias An efficient clustering algorithm in the presence on outlier and doubtful data
description	v ABSTRACT The presence of outlying observations is a common problem in most statistical analysis. This case is also true when using cluster analysis techniques. Cluster analysis basically detects homogeneous clusters with large heterogeneity among them. To deal with outliers, a correct procedure in cluster analysis is needed because usually outliers may appear joined together, which may lead to the wrong structure of clusters. New method of trimming in clustering (TCLUST) known as RTCLUST is proposed in this research that uses some information from TCLUST, partition around medoid (PAM), doubtful cluster and local outlier factor (LOF). TCLUST is a clustering method with constraint on the covariance matrices. For this case the constraint used was the eigenvalues. Spurious outlier model explains how to use the eigenvalues ratio, c for good clustering method. Good clustering is obtained using mean of discriminant. The value of c = 50 is obtained as a better value compared to the previous study c = 1. Trimmed likelihood is then used to determine the trimming proportion, α and number of clusters, k. The next procedure combines the TCLUST and PAM, which is known as MPAM. PAM is used because the mean of silhouette explains the clustering much better. The information obtained from MPAM are c = 50, α , and k. Different sample sizes are also used to test the suitability of MPAM. Mean of discriminant and mean of silhouette are then used to measure the strength of clustering. Trimmed likelihood curve is used again to check the values of α , and k. For the next step, using the doubtful cluster method with c = 50, the method shows the overlapping outliers that exist between clusters. In this case, the data in the overlapping area are classified as doubtful outliers and it is decided that the best threshold is 0.1. Lastly, the LOF is used to differentiate between doubtful outliers and real outliers in overlapping areas. Since LOF can detect real outliers, the deletion of this outlier is mandatory. Again, the mean of discriminant and mean of silhouette are obtained after the deletion of real outliers. A trimmed likelihood curve is then used to obtain the final value for α and k. This new procedure of RTCLUST uses c = 50 and threshold value equals 0.1 to obtain the mean of discriminant and mean of silhouette. To justify RTCLUST, medium sample size with Monte Carlo simulation is done to check the right possibility of combining methods, and therefore the normality of RTCLUST can be checked. Results found that the normality assumption for RTCLUST is fulfilled and Bayesian test can be used to significantly decide the value of k. Results for RTCLUST with having the lowest RMSE value shows that it is better than MPAM and TCLUST for both simulation and real data.
format	Thesis
qualification_name	Doctor of Philosophy (PhD.)
qualification_level	Doctorate
author	Md. Jedi, Muhamad Alias
author_facet	Md. Jedi, Muhamad Alias
author_sort	Md. Jedi, Muhamad Alias
title	An efficient clustering algorithm in the presence on outlier and doubtful data
title_short	An efficient clustering algorithm in the presence on outlier and doubtful data
title_full	An efficient clustering algorithm in the presence on outlier and doubtful data
title_fullStr	An efficient clustering algorithm in the presence on outlier and doubtful data
title_full_unstemmed	An efficient clustering algorithm in the presence on outlier and doubtful data
title_sort	efficient clustering algorithm in the presence on outlier and doubtful data
granting_institution	Universiti Teknologi Malaysia, Faculty of Science
granting_department	Faculty of Science
publishDate	2015
url	http://eprints.utm.my/id/eprint/79401/1/MuhamadAliasPFS2015.pdf
_version_	1747818219119312896

An efficient clustering algorithm in the presence on outlier and doubtful data

相似書籍