Optimized clustering with modified K-means algorithm

huge data is a big challenge. Clustering technique is able to find hidden patterns and to extract useful information from huge data. Among the techniques, the k-means algorithm is the most commonly used technique for determining optimal number of clusters (k). However, the choice of k is a prominent...

Full description

Saved in:
Bibliographic Details
Main Author: Alibuhtto, Mohamed Cassim
Format: Thesis
Language:eng
eng
eng
Published: 2021
Subjects:
Online Access:https://etd.uum.edu.my/9556/1/depositpermission-not%20allow_s902303.pdf
https://etd.uum.edu.my/9556/2/s902303_01.pdf
https://etd.uum.edu.my/9556/3/s902303_02.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
id my-uum-etd.9556
record_format uketd_dc
spelling my-uum-etd.95562022-06-27T07:05:03Z Optimized clustering with modified K-means algorithm 2021 Alibuhtto, Mohamed Cassim Mahat, Nor Idayu Awang Had Salleh Graduate School of Arts & Sciences Awang Had Salleh Graduate School of Arts & Sciences QA Mathematics huge data is a big challenge. Clustering technique is able to find hidden patterns and to extract useful information from huge data. Among the techniques, the k-means algorithm is the most commonly used technique for determining optimal number of clusters (k). However, the choice of k is a prominent problem in the process of the k-means algorithm. In most cases, for clustering huge data, k is pre-determined by researchers and incorrectly chosen k, could end with wrong interpretation of clusters and increase computational cost. Besides, huge data often face with correlated variables which lead to incorrect clustering process. In order to obtain the optimum number of clusters and at the same time could deal with correlated variables in huge data, modified k-means algorithm was proposed. The proposed algorithm utilised a distance measure to compute the between groups’ separation to accelerate the process of identifying an optimal number of clusters using k-means. Two distance measures were considered namely Euclidean and Manhattan distances. In dealing with correlated variables, PCA was embedded in the proposed algorithm. The developed algorithms were tested on uncorrelated and correlated simulated data sets, generated under various conditions. Besides, some real data sets were examined to validate the proposed algorithm. Empirical evidences based on simulated data sets indicated that the proposed modified k-means algorithm is able to recognise the optimum number of clusters for uncorrelated data sets. While, the PCA based on modified k-means managed to identify the optimum number of clusters for correlated data sets. Also, the results revealed that the modified k-means algorithm with Euclidean distance yields optimum number of clusters compared to the Manhattan distance. Testing on real data sets showed consistency results as the simulated ones. Generally, the proposed modified k-means algorithm is able to determine the optimum number of clusters for huge data. 2021 Thesis https://etd.uum.edu.my/9556/ https://etd.uum.edu.my/9556/1/depositpermission-not%20allow_s902303.pdf text eng staffonly https://etd.uum.edu.my/9556/2/s902303_01.pdf text eng staffonly https://etd.uum.edu.my/9556/3/s902303_02.pdf text eng staffonly other doctoral Universiti Utara Malaysia
institution Universiti Utara Malaysia
collection UUM ETD
language eng
eng
eng
advisor Mahat, Nor Idayu
topic QA Mathematics
spellingShingle QA Mathematics
Alibuhtto, Mohamed Cassim
Optimized clustering with modified K-means algorithm
description huge data is a big challenge. Clustering technique is able to find hidden patterns and to extract useful information from huge data. Among the techniques, the k-means algorithm is the most commonly used technique for determining optimal number of clusters (k). However, the choice of k is a prominent problem in the process of the k-means algorithm. In most cases, for clustering huge data, k is pre-determined by researchers and incorrectly chosen k, could end with wrong interpretation of clusters and increase computational cost. Besides, huge data often face with correlated variables which lead to incorrect clustering process. In order to obtain the optimum number of clusters and at the same time could deal with correlated variables in huge data, modified k-means algorithm was proposed. The proposed algorithm utilised a distance measure to compute the between groups’ separation to accelerate the process of identifying an optimal number of clusters using k-means. Two distance measures were considered namely Euclidean and Manhattan distances. In dealing with correlated variables, PCA was embedded in the proposed algorithm. The developed algorithms were tested on uncorrelated and correlated simulated data sets, generated under various conditions. Besides, some real data sets were examined to validate the proposed algorithm. Empirical evidences based on simulated data sets indicated that the proposed modified k-means algorithm is able to recognise the optimum number of clusters for uncorrelated data sets. While, the PCA based on modified k-means managed to identify the optimum number of clusters for correlated data sets. Also, the results revealed that the modified k-means algorithm with Euclidean distance yields optimum number of clusters compared to the Manhattan distance. Testing on real data sets showed consistency results as the simulated ones. Generally, the proposed modified k-means algorithm is able to determine the optimum number of clusters for huge data.
format Thesis
qualification_name other
qualification_level Doctorate
author Alibuhtto, Mohamed Cassim
author_facet Alibuhtto, Mohamed Cassim
author_sort Alibuhtto, Mohamed Cassim
title Optimized clustering with modified K-means algorithm
title_short Optimized clustering with modified K-means algorithm
title_full Optimized clustering with modified K-means algorithm
title_fullStr Optimized clustering with modified K-means algorithm
title_full_unstemmed Optimized clustering with modified K-means algorithm
title_sort optimized clustering with modified k-means algorithm
granting_institution Universiti Utara Malaysia
granting_department Awang Had Salleh Graduate School of Arts & Sciences
publishDate 2021
url https://etd.uum.edu.my/9556/1/depositpermission-not%20allow_s902303.pdf
https://etd.uum.edu.my/9556/2/s902303_01.pdf
https://etd.uum.edu.my/9556/3/s902303_02.pdf
_version_ 1747828621590921216