K-means algorithm via preprocessing technique and singular value decomposition for high dimension datasets

Data clustering is an unsupervised classification method aimed at creating groups of objects, or clusters that are distinct. Among the clustering techniques, Kmeans is the most widely used technique. Two issues are prominent in creating a Kmeans clustering algorithm; the optimal number of clusters a...

Full description

Saved in:
Bibliographic Details
Main Author: Usman, Dauda
Format: Thesis
Language:English
Published: 2014
Subjects:
Online Access:http://eprints.utm.my/id/eprint/77643/1/DaudaUsmanPFS2014.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
id my-utm-ep.77643
record_format uketd_dc
spelling my-utm-ep.776432018-06-26T07:37:23Z K-means algorithm via preprocessing technique and singular value decomposition for high dimension datasets 2014-12 Usman, Dauda QA Mathematics Data clustering is an unsupervised classification method aimed at creating groups of objects, or clusters that are distinct. Among the clustering techniques, Kmeans is the most widely used technique. Two issues are prominent in creating a Kmeans clustering algorithm; the optimal number of clusters and the center of the clusters. In most cases, the number of clusters is pre-determined by the researcher, thus leaving out the challenge of determining the cluster centers so that scattered points can be grouped properly. However, if the cluster centers are not chosen correctly computational complexity is expected to increase, especially for high dimensional data set. In order to obtain an optimum solution for K-means cluster analysis, the data needs to be pre-processed. This is achieved by either data standardization or using principal component analysis on rescaled data to reduce the dimensionality of the data. Based on the outcomes of the preprocessing carried out on the data, a hybrid K-means clustering method of center initialization is developed for producing optimum quality clusters which makes the algorithm more efficient. This research investigates and analyzes the performance behavior of the basic Kmeans clustering algorithm when three different standardization methods are used, namely decimal scaling, z-score and min-max. The results show that, z-score perform the best, judging from the sum of square error. Further experiments on the hybrid algorithm are conducted using uncorrelated and correlated simulated data sets having low, moderate and high dimension and it is observed that the method presented in this thesis gives a good and promising performance. It is also observed that, the sum of the total clustering errors reduced significantly whereas interdistances between clusters are preserved to be as large as possible for better clusters identification. The results and findings are validated using life data on infectious diseases. 2014-12 Thesis http://eprints.utm.my/id/eprint/77643/ http://eprints.utm.my/id/eprint/77643/1/DaudaUsmanPFS2014.pdf application/pdf en public http://dms.library.utm.my:8080/vital/access/manager/Repository/vital:98594 phd doctoral Universiti Teknologi Malaysia, Faculty of Science Faculty of Science
institution Universiti Teknologi Malaysia
collection UTM Institutional Repository
language English
topic QA Mathematics
spellingShingle QA Mathematics
Usman, Dauda
K-means algorithm via preprocessing technique and singular value decomposition for high dimension datasets
description Data clustering is an unsupervised classification method aimed at creating groups of objects, or clusters that are distinct. Among the clustering techniques, Kmeans is the most widely used technique. Two issues are prominent in creating a Kmeans clustering algorithm; the optimal number of clusters and the center of the clusters. In most cases, the number of clusters is pre-determined by the researcher, thus leaving out the challenge of determining the cluster centers so that scattered points can be grouped properly. However, if the cluster centers are not chosen correctly computational complexity is expected to increase, especially for high dimensional data set. In order to obtain an optimum solution for K-means cluster analysis, the data needs to be pre-processed. This is achieved by either data standardization or using principal component analysis on rescaled data to reduce the dimensionality of the data. Based on the outcomes of the preprocessing carried out on the data, a hybrid K-means clustering method of center initialization is developed for producing optimum quality clusters which makes the algorithm more efficient. This research investigates and analyzes the performance behavior of the basic Kmeans clustering algorithm when three different standardization methods are used, namely decimal scaling, z-score and min-max. The results show that, z-score perform the best, judging from the sum of square error. Further experiments on the hybrid algorithm are conducted using uncorrelated and correlated simulated data sets having low, moderate and high dimension and it is observed that the method presented in this thesis gives a good and promising performance. It is also observed that, the sum of the total clustering errors reduced significantly whereas interdistances between clusters are preserved to be as large as possible for better clusters identification. The results and findings are validated using life data on infectious diseases.
format Thesis
qualification_name Doctor of Philosophy (PhD.)
qualification_level Doctorate
author Usman, Dauda
author_facet Usman, Dauda
author_sort Usman, Dauda
title K-means algorithm via preprocessing technique and singular value decomposition for high dimension datasets
title_short K-means algorithm via preprocessing technique and singular value decomposition for high dimension datasets
title_full K-means algorithm via preprocessing technique and singular value decomposition for high dimension datasets
title_fullStr K-means algorithm via preprocessing technique and singular value decomposition for high dimension datasets
title_full_unstemmed K-means algorithm via preprocessing technique and singular value decomposition for high dimension datasets
title_sort k-means algorithm via preprocessing technique and singular value decomposition for high dimension datasets
granting_institution Universiti Teknologi Malaysia, Faculty of Science
granting_department Faculty of Science
publishDate 2014
url http://eprints.utm.my/id/eprint/77643/1/DaudaUsmanPFS2014.pdf
_version_ 1747817797405114368