K-means algorithm via preprocessing technique and singular value decomposition for high dimension datasets

Data clustering is an unsupervised classification method aimed at creating groups of objects, or clusters that are distinct. Among the clustering techniques, Kmeans is the most widely used technique. Two issues are prominent in creating a Kmeans clustering algorithm; the optimal number of clusters a...

Full description

Saved in:
Bibliographic Details
Main Author: Usman, Dauda
Format: Thesis
Language:English
Published: 2014
Subjects:
Online Access:http://eprints.utm.my/id/eprint/77643/1/DaudaUsmanPFS2014.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Data clustering is an unsupervised classification method aimed at creating groups of objects, or clusters that are distinct. Among the clustering techniques, Kmeans is the most widely used technique. Two issues are prominent in creating a Kmeans clustering algorithm; the optimal number of clusters and the center of the clusters. In most cases, the number of clusters is pre-determined by the researcher, thus leaving out the challenge of determining the cluster centers so that scattered points can be grouped properly. However, if the cluster centers are not chosen correctly computational complexity is expected to increase, especially for high dimensional data set. In order to obtain an optimum solution for K-means cluster analysis, the data needs to be pre-processed. This is achieved by either data standardization or using principal component analysis on rescaled data to reduce the dimensionality of the data. Based on the outcomes of the preprocessing carried out on the data, a hybrid K-means clustering method of center initialization is developed for producing optimum quality clusters which makes the algorithm more efficient. This research investigates and analyzes the performance behavior of the basic Kmeans clustering algorithm when three different standardization methods are used, namely decimal scaling, z-score and min-max. The results show that, z-score perform the best, judging from the sum of square error. Further experiments on the hybrid algorithm are conducted using uncorrelated and correlated simulated data sets having low, moderate and high dimension and it is observed that the method presented in this thesis gives a good and promising performance. It is also observed that, the sum of the total clustering errors reduced significantly whereas interdistances between clusters are preserved to be as large as possible for better clusters identification. The results and findings are validated using life data on infectious diseases.