K-means algorithm via preprocessing technique and singular value decomposition for high dimension datasets
Data clustering is an unsupervised classification method aimed at creating groups of objects, or clusters that are distinct. Among the clustering techniques, Kmeans is the most widely used technique. Two issues are prominent in creating a Kmeans clustering algorithm; the optimal number of clusters a...
Saved in:
Main Author: | |
---|---|
Format: | Thesis |
Language: | English |
Published: |
2014
|
Subjects: | |
Online Access: | http://eprints.utm.my/id/eprint/77643/1/DaudaUsmanPFS2014.pdf |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
id |
my-utm-ep.77643 |
---|---|
record_format |
uketd_dc |
spelling |
my-utm-ep.776432018-06-26T07:37:23Z K-means algorithm via preprocessing technique and singular value decomposition for high dimension datasets 2014-12 Usman, Dauda QA Mathematics Data clustering is an unsupervised classification method aimed at creating groups of objects, or clusters that are distinct. Among the clustering techniques, Kmeans is the most widely used technique. Two issues are prominent in creating a Kmeans clustering algorithm; the optimal number of clusters and the center of the clusters. In most cases, the number of clusters is pre-determined by the researcher, thus leaving out the challenge of determining the cluster centers so that scattered points can be grouped properly. However, if the cluster centers are not chosen correctly computational complexity is expected to increase, especially for high dimensional data set. In order to obtain an optimum solution for K-means cluster analysis, the data needs to be pre-processed. This is achieved by either data standardization or using principal component analysis on rescaled data to reduce the dimensionality of the data. Based on the outcomes of the preprocessing carried out on the data, a hybrid K-means clustering method of center initialization is developed for producing optimum quality clusters which makes the algorithm more efficient. This research investigates and analyzes the performance behavior of the basic Kmeans clustering algorithm when three different standardization methods are used, namely decimal scaling, z-score and min-max. The results show that, z-score perform the best, judging from the sum of square error. Further experiments on the hybrid algorithm are conducted using uncorrelated and correlated simulated data sets having low, moderate and high dimension and it is observed that the method presented in this thesis gives a good and promising performance. It is also observed that, the sum of the total clustering errors reduced significantly whereas interdistances between clusters are preserved to be as large as possible for better clusters identification. The results and findings are validated using life data on infectious diseases. 2014-12 Thesis http://eprints.utm.my/id/eprint/77643/ http://eprints.utm.my/id/eprint/77643/1/DaudaUsmanPFS2014.pdf application/pdf en public http://dms.library.utm.my:8080/vital/access/manager/Repository/vital:98594 phd doctoral Universiti Teknologi Malaysia, Faculty of Science Faculty of Science |
institution |
Universiti Teknologi Malaysia |
collection |
UTM Institutional Repository |
language |
English |
topic |
QA Mathematics |
spellingShingle |
QA Mathematics Usman, Dauda K-means algorithm via preprocessing technique and singular value decomposition for high dimension datasets |
description |
Data clustering is an unsupervised classification method aimed at creating groups of objects, or clusters that are distinct. Among the clustering techniques, Kmeans is the most widely used technique. Two issues are prominent in creating a Kmeans clustering algorithm; the optimal number of clusters and the center of the clusters. In most cases, the number of clusters is pre-determined by the researcher, thus leaving out the challenge of determining the cluster centers so that scattered points can be grouped properly. However, if the cluster centers are not chosen correctly computational complexity is expected to increase, especially for high dimensional data set. In order to obtain an optimum solution for K-means cluster analysis, the data needs to be pre-processed. This is achieved by either data standardization or using principal component analysis on rescaled data to reduce the dimensionality of the data. Based on the outcomes of the preprocessing carried out on the data, a hybrid K-means clustering method of center initialization is developed for producing optimum quality clusters which makes the algorithm more efficient. This research investigates and analyzes the performance behavior of the basic Kmeans clustering algorithm when three different standardization methods are used, namely decimal scaling, z-score and min-max. The results show that, z-score perform the best, judging from the sum of square error. Further experiments on the hybrid algorithm are conducted using uncorrelated and correlated simulated data sets having low, moderate and high dimension and it is observed that the method presented in this thesis gives a good and promising performance. It is also observed that, the sum of the total clustering errors reduced significantly whereas interdistances between clusters are preserved to be as large as possible for better clusters identification. The results and findings are validated using life data on infectious diseases. |
format |
Thesis |
qualification_name |
Doctor of Philosophy (PhD.) |
qualification_level |
Doctorate |
author |
Usman, Dauda |
author_facet |
Usman, Dauda |
author_sort |
Usman, Dauda |
title |
K-means algorithm via preprocessing technique and singular value decomposition for high dimension datasets |
title_short |
K-means algorithm via preprocessing technique and singular value decomposition for high dimension datasets |
title_full |
K-means algorithm via preprocessing technique and singular value decomposition for high dimension datasets |
title_fullStr |
K-means algorithm via preprocessing technique and singular value decomposition for high dimension datasets |
title_full_unstemmed |
K-means algorithm via preprocessing technique and singular value decomposition for high dimension datasets |
title_sort |
k-means algorithm via preprocessing technique and singular value decomposition for high dimension datasets |
granting_institution |
Universiti Teknologi Malaysia, Faculty of Science |
granting_department |
Faculty of Science |
publishDate |
2014 |
url |
http://eprints.utm.my/id/eprint/77643/1/DaudaUsmanPFS2014.pdf |
_version_ |
1747817797405114368 |