K-means algorithm via preprocessing technique and singular value decomposition for high dimension datasets

Data clustering is an unsupervised classification method aimed at creating groups of objects, or clusters that are distinct. Among the clustering techniques, Kmeans is the most widely used technique. Two issues are prominent in creating a Kmeans clustering algorithm; the optimal number of clusters a...

全面介紹

Saved in:

書目詳細資料
主要作者:	Usman, Dauda
格式:	Thesis
語言:	English
出版:	2014
主題:	QA Mathematics
在線閱讀:	http://eprints.utm.my/id/eprint/77643/1/DaudaUsmanPFS2014.pdf
標簽:	添加標簽沒有標簽, 成為第一個標記此記錄!

id	my-utm-ep.77643
record_format	uketd_dc
spelling	my-utm-ep.776432018-06-26T07:37:23Z K-means algorithm via preprocessing technique and singular value decomposition for high dimension datasets 2014-12 Usman, Dauda QA Mathematics Data clustering is an unsupervised classification method aimed at creating groups of objects, or clusters that are distinct. Among the clustering techniques, Kmeans is the most widely used technique. Two issues are prominent in creating a Kmeans clustering algorithm; the optimal number of clusters and the center of the clusters. In most cases, the number of clusters is pre-determined by the researcher, thus leaving out the challenge of determining the cluster centers so that scattered points can be grouped properly. However, if the cluster centers are not chosen correctly computational complexity is expected to increase, especially for high dimensional data set. In order to obtain an optimum solution for K-means cluster analysis, the data needs to be pre-processed. This is achieved by either data standardization or using principal component analysis on rescaled data to reduce the dimensionality of the data. Based on the outcomes of the preprocessing carried out on the data, a hybrid K-means clustering method of center initialization is developed for producing optimum quality clusters which makes the algorithm more efficient. This research investigates and analyzes the performance behavior of the basic Kmeans clustering algorithm when three different standardization methods are used, namely decimal scaling, z-score and min-max. The results show that, z-score perform the best, judging from the sum of square error. Further experiments on the hybrid algorithm are conducted using uncorrelated and correlated simulated data sets having low, moderate and high dimension and it is observed that the method presented in this thesis gives a good and promising performance. It is also observed that, the sum of the total clustering errors reduced significantly whereas interdistances between clusters are preserved to be as large as possible for better clusters identification. The results and findings are validated using life data on infectious diseases. 2014-12 Thesis http://eprints.utm.my/id/eprint/77643/ http://eprints.utm.my/id/eprint/77643/1/DaudaUsmanPFS2014.pdf application/pdf en public http://dms.library.utm.my:8080/vital/access/manager/Repository/vital:98594 phd doctoral Universiti Teknologi Malaysia, Faculty of Science Faculty of Science
institution	Universiti Teknologi Malaysia
collection	UTM Institutional Repository
language	English
topic	QA Mathematics
spellingShingle	QA Mathematics Usman, Dauda K-means algorithm via preprocessing technique and singular value decomposition for high dimension datasets
description	Data clustering is an unsupervised classification method aimed at creating groups of objects, or clusters that are distinct. Among the clustering techniques, Kmeans is the most widely used technique. Two issues are prominent in creating a Kmeans clustering algorithm; the optimal number of clusters and the center of the clusters. In most cases, the number of clusters is pre-determined by the researcher, thus leaving out the challenge of determining the cluster centers so that scattered points can be grouped properly. However, if the cluster centers are not chosen correctly computational complexity is expected to increase, especially for high dimensional data set. In order to obtain an optimum solution for K-means cluster analysis, the data needs to be pre-processed. This is achieved by either data standardization or using principal component analysis on rescaled data to reduce the dimensionality of the data. Based on the outcomes of the preprocessing carried out on the data, a hybrid K-means clustering method of center initialization is developed for producing optimum quality clusters which makes the algorithm more efficient. This research investigates and analyzes the performance behavior of the basic Kmeans clustering algorithm when three different standardization methods are used, namely decimal scaling, z-score and min-max. The results show that, z-score perform the best, judging from the sum of square error. Further experiments on the hybrid algorithm are conducted using uncorrelated and correlated simulated data sets having low, moderate and high dimension and it is observed that the method presented in this thesis gives a good and promising performance. It is also observed that, the sum of the total clustering errors reduced significantly whereas interdistances between clusters are preserved to be as large as possible for better clusters identification. The results and findings are validated using life data on infectious diseases.
format	Thesis
qualification_name	Doctor of Philosophy (PhD.)
qualification_level	Doctorate
author	Usman, Dauda
author_facet	Usman, Dauda
author_sort	Usman, Dauda
title	K-means algorithm via preprocessing technique and singular value decomposition for high dimension datasets
title_short	K-means algorithm via preprocessing technique and singular value decomposition for high dimension datasets
title_full	K-means algorithm via preprocessing technique and singular value decomposition for high dimension datasets
title_fullStr	K-means algorithm via preprocessing technique and singular value decomposition for high dimension datasets
title_full_unstemmed	K-means algorithm via preprocessing technique and singular value decomposition for high dimension datasets
title_sort	k-means algorithm via preprocessing technique and singular value decomposition for high dimension datasets
granting_institution	Universiti Teknologi Malaysia, Faculty of Science
granting_department	Faculty of Science
publishDate	2014
url	http://eprints.utm.my/id/eprint/77643/1/DaudaUsmanPFS2014.pdf
_version_	1747817797405114368

K-means algorithm via preprocessing technique and singular value decomposition for high dimension datasets

相似書籍