Taxonomy learning from Malay texts using artificial immune system based clustering

In taxonomy learning from texts, the extracted features that are used to describe the context of a term usually are erroneous and sparse. Various attempts to overcome data sparseness and noise have been made using clustering algorithm such as Hierarchical Agglomerative Clustering (HAC), Bisecting K-...

Full description

Saved in:
Bibliographic Details
Main Author: Ahmad Nazri, Mohd. Zakree
Format: Thesis
Language:English
Published: 2011
Subjects:
Online Access:http://eprints.utm.my/id/eprint/36947/1/MohdZakreeAhmadNazriPFSKSM2011.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
id my-utm-ep.36947
record_format uketd_dc
spelling my-utm-ep.369472018-05-27T08:15:39Z Taxonomy learning from Malay texts using artificial immune system based clustering 2011-04 Ahmad Nazri, Mohd. Zakree QA75 Electronic computers. Computer science In taxonomy learning from texts, the extracted features that are used to describe the context of a term usually are erroneous and sparse. Various attempts to overcome data sparseness and noise have been made using clustering algorithm such as Hierarchical Agglomerative Clustering (HAC), Bisecting K-means and Guided Agglomerative Hierarchical Clustering (GAHC). However these methods suffer low recall. Therefore, the purpose of this study is to investigate the application of two hybridized artificial immune system (AIS) in taxonomy learning from Malay text and develop a Google-based Text Miner (GTM) for feature selection to reduce data sparseness. Two novel taxonomy learning algorithms have been proposed and compared with the benchmark methods (i.e., HAC, GAHC and Bisecting K-means). The first algorithm is designed through the hybridization of GAHC and Artificial Immune Network (aiNet) called GCAINT (Guided Clustering and aiNet for Taxonomy Learning). The GCAINT algorithm exploits a Hypernym Oracle (HO) to guide the hierarchical clustering process and produce better results than the benchmark methods. However, the Malay HO introduces erroneous hypernym-hyponym pairs and affects the result. Therefore, the second novel algorithm called CLOSAT (Clonal Selection Algorithm for Taxonomy Learning) is proposed by hybridizing Clonal Selection Algorithm (CLONALG) and Bisecting k-means. CLOSAT produces the best results compared to the benchmark methods and GCAINT. In order to reduce sparseness in the obtained dataset, the GTM is proposed. However, the experimental results reveal that GTM introduces too many noises into the dataset which leads to many false positives of hypernym-hyponym pairs. The effect of different combinations of affinity measurement (i.e., Hamming, Jaccard and Rand) on the performance of the developed methods was also studied. Jaccard is found better than Hamming and Rand in measuring the similarity distance between terms. In addition, the use of Particle Swarm Optimization (PSO) for automatic parameter tuning the GCAINT and CLOSAT was also proposed. Experimental results demonstrate that in most cases, PSO-tuned CLOSAT and GCAINT produce better results compared to the benchmark methods and able to reduce data sparseness and noise in the dataset. 2011-04 Thesis http://eprints.utm.my/id/eprint/36947/ http://eprints.utm.my/id/eprint/36947/1/MohdZakreeAhmadNazriPFSKSM2011.pdf application/pdf en public phd doctoral Universiti Teknologi Malaysia, Faculty of Computer Science and Information System Faculty of Computer Science and Information System
institution Universiti Teknologi Malaysia
collection UTM Institutional Repository
language English
topic QA75 Electronic computers
Computer science
spellingShingle QA75 Electronic computers
Computer science
Ahmad Nazri, Mohd. Zakree
Taxonomy learning from Malay texts using artificial immune system based clustering
description In taxonomy learning from texts, the extracted features that are used to describe the context of a term usually are erroneous and sparse. Various attempts to overcome data sparseness and noise have been made using clustering algorithm such as Hierarchical Agglomerative Clustering (HAC), Bisecting K-means and Guided Agglomerative Hierarchical Clustering (GAHC). However these methods suffer low recall. Therefore, the purpose of this study is to investigate the application of two hybridized artificial immune system (AIS) in taxonomy learning from Malay text and develop a Google-based Text Miner (GTM) for feature selection to reduce data sparseness. Two novel taxonomy learning algorithms have been proposed and compared with the benchmark methods (i.e., HAC, GAHC and Bisecting K-means). The first algorithm is designed through the hybridization of GAHC and Artificial Immune Network (aiNet) called GCAINT (Guided Clustering and aiNet for Taxonomy Learning). The GCAINT algorithm exploits a Hypernym Oracle (HO) to guide the hierarchical clustering process and produce better results than the benchmark methods. However, the Malay HO introduces erroneous hypernym-hyponym pairs and affects the result. Therefore, the second novel algorithm called CLOSAT (Clonal Selection Algorithm for Taxonomy Learning) is proposed by hybridizing Clonal Selection Algorithm (CLONALG) and Bisecting k-means. CLOSAT produces the best results compared to the benchmark methods and GCAINT. In order to reduce sparseness in the obtained dataset, the GTM is proposed. However, the experimental results reveal that GTM introduces too many noises into the dataset which leads to many false positives of hypernym-hyponym pairs. The effect of different combinations of affinity measurement (i.e., Hamming, Jaccard and Rand) on the performance of the developed methods was also studied. Jaccard is found better than Hamming and Rand in measuring the similarity distance between terms. In addition, the use of Particle Swarm Optimization (PSO) for automatic parameter tuning the GCAINT and CLOSAT was also proposed. Experimental results demonstrate that in most cases, PSO-tuned CLOSAT and GCAINT produce better results compared to the benchmark methods and able to reduce data sparseness and noise in the dataset.
format Thesis
qualification_name Doctor of Philosophy (PhD.)
qualification_level Doctorate
author Ahmad Nazri, Mohd. Zakree
author_facet Ahmad Nazri, Mohd. Zakree
author_sort Ahmad Nazri, Mohd. Zakree
title Taxonomy learning from Malay texts using artificial immune system based clustering
title_short Taxonomy learning from Malay texts using artificial immune system based clustering
title_full Taxonomy learning from Malay texts using artificial immune system based clustering
title_fullStr Taxonomy learning from Malay texts using artificial immune system based clustering
title_full_unstemmed Taxonomy learning from Malay texts using artificial immune system based clustering
title_sort taxonomy learning from malay texts using artificial immune system based clustering
granting_institution Universiti Teknologi Malaysia, Faculty of Computer Science and Information System
granting_department Faculty of Computer Science and Information System
publishDate 2011
url http://eprints.utm.my/id/eprint/36947/1/MohdZakreeAhmadNazriPFSKSM2011.pdf
_version_ 1747816482457255936