A multi-objectives genetic algorithm clustering ensembles based approach to summarize relational data

K-means algorithm is one of the well-known clustering algorithms that promise to converge to a local optimum in few iterative. However, traditional k-means algorithm is designed to cluster data of single target table. Due to the nature of data collected in real life applications, many data have been...

全面介紹

Saved in:
書目詳細資料
主要作者: Gabriel, Jong Chiye
格式: Thesis
語言:English
出版: 2015
主題:
在線閱讀:https://eprints.ums.edu.my/id/eprint/12105/1/mt0000000678.pdf
標簽: 添加標簽
沒有標簽, 成為第一個標記此記錄!
實物特徵
總結:K-means algorithm is one of the well-known clustering algorithms that promise to converge to a local optimum in few iterative. However, traditional k-means algorithm is designed to cluster data of single target table. Due to the nature of data collected in real life applications, many data have been collected and stored in relational databases. Traditional clustering and classification learning algorithms cannot be applied directly in learning multi-relational databases. Several approaches have been designed and proposed to learn relational data which includes Inductive Logic Programming based approaches, Graph based approaches, Multi-View approaches and also Dynamic Aggregation of Relational Attributes approach. Dynamic Aggregation of Relational Attributes approach is very effective in learning relational data set. Dynamic Aggregation of Relational Attributes summarizes relational data by clustering records exist in non-target tables. However, the quality of summarization of data depends highly on the position of initial centroids selected. Thus, it may affect the overall classification task. Thus, this project proposes a Genetic Algorithm-based Clustering Ensembles in learning relational datasets by combining the results obtained from several k-means clustering runs with different values of number of clusters, in which the location of centroids are optimal for every sets of clusters. The effects of using different similarity measurements and applying different fitness functions for the genetic algorithm on the predictive accuracies of the classifiers are also studied. Based on the results obtained, it can be concluded that using the consensus result of several clustering results can increase the predictive accuracy of classification task. It can be concluded that the Euclidean distance has better performance on mutagenesis datasets and cosine similarity has better performance on hepatitis datasets when evaluated with Weka C4.5 classifier, but the other way round when Naïve Bayes classifier is used for evaluation.