Integrated framework with association analysis for gene selection in microarray data classification

Microarray data classification is one of the major interests in health informatics that aims at discovering hidden patterns in gene expression profiles. The main challenge in building this classification system is the curse of dimensionality problem. Therefore, gene selection is an indispensable tas...

Full description

Saved in:

Bibliographic Details
Main Author:	Ong, Huey Fang
Format:	Thesis
Language:	English English
Published:	2011
Subjects:	DNA microarrays - Classification Gene expression Data mining
Online Access:	http://psasir.upm.edu.my/id/eprint/27711/1/FSKTM%202011%2029R.pdf
Tags:	Add Tag No Tags, Be the first to tag this record!

id	my-upm-ir.27711
record_format	uketd_dc
spelling	my-upm-ir.277112014-04-10T04:22:58Z Integrated framework with association analysis for gene selection in microarray data classification 2011-04 Ong, Huey Fang Microarray data classification is one of the major interests in health informatics that aims at discovering hidden patterns in gene expression profiles. The main challenge in building this classification system is the curse of dimensionality problem. Therefore, gene selection is an indispensable task in microarray data classification to identify smaller sets of relevant genes. However, most of the existing gene selection methods are statistical analyses and purely based on gene expression values in the identification of differentially expressed genes. As a result, the selected genes might be false positives and are not biologically meaningful. The purpose of this study was to integrate microarray dataset with additional biological information for selecting genes that are not only differentially expressed but also informative for classifiers. To achieve that, an integrated framework with a new gene selection method was developed to improve classification performance in terms of accuracy and number of selected genes. The proposed gene selection method combined the strength of both filter method and association analysis to identify a set of discriminative and informative genes. Association analysis was employed to integrate more than one type of biological information in the same transaction database, and also to identify groups of genes that are frequently co-occurred in target samples. Modifications have been made on the existing association algorithm for mining frequent itemsets, where genes in each itemset were sorted according to their discriminative scores rather than according to lexicographic order. In addition to that, discriminative scores were used to compute interestingness of frequent itemsets before ranking them. The proposed integrated framework has been tested on colon cancer, leukemia, breast cancer and lung cancer microarray datasets. Two types of biological information were incorporated in the selection process, namely the Gene Ontology (GO) and the KEGG Pathways (KEGG). The experimental results showed that the recommended GO based models, KEGG based models, and GO-KEGG based models outperformed the expression-only models by attaining better classification accuracies with less number of genes. In the experiments, leukemia and lung cancer datasets had achieved 100% accuracies in all the classifiers with number of selected genes as small as three. On the other hand, colon cancer and breast cancer datasets achieved better classification accuracies compared with the previous integrated method, which are 95.16% and 95.88% respectively. Moreover, the proposed integrated framework proved to build informative and interpretable microarray classification models. The selected genes can be traced back to their functional annotations and association groups for reasoning and creating new hypotheses for future investigation. DNA microarrays - Classification Gene expression Data mining 2011-04 Thesis http://psasir.upm.edu.my/id/eprint/27711/ http://psasir.upm.edu.my/id/eprint/27711/1/FSKTM%202011%2029R.pdf application/pdf en public masters Universiti Putra Malaysia DNA microarrays - Classification Gene expression Data mining Faculty of Computer Science and Information Technology English
institution	Universiti Putra Malaysia
collection	PSAS Institutional Repository
language	English English
topic	DNA microarrays - Classification Gene expression Data mining
spellingShingle	DNA microarrays - Classification Gene expression Data mining Ong, Huey Fang Integrated framework with association analysis for gene selection in microarray data classification
description	Microarray data classification is one of the major interests in health informatics that aims at discovering hidden patterns in gene expression profiles. The main challenge in building this classification system is the curse of dimensionality problem. Therefore, gene selection is an indispensable task in microarray data classification to identify smaller sets of relevant genes. However, most of the existing gene selection methods are statistical analyses and purely based on gene expression values in the identification of differentially expressed genes. As a result, the selected genes might be false positives and are not biologically meaningful. The purpose of this study was to integrate microarray dataset with additional biological information for selecting genes that are not only differentially expressed but also informative for classifiers. To achieve that, an integrated framework with a new gene selection method was developed to improve classification performance in terms of accuracy and number of selected genes. The proposed gene selection method combined the strength of both filter method and association analysis to identify a set of discriminative and informative genes. Association analysis was employed to integrate more than one type of biological information in the same transaction database, and also to identify groups of genes that are frequently co-occurred in target samples. Modifications have been made on the existing association algorithm for mining frequent itemsets, where genes in each itemset were sorted according to their discriminative scores rather than according to lexicographic order. In addition to that, discriminative scores were used to compute interestingness of frequent itemsets before ranking them. The proposed integrated framework has been tested on colon cancer, leukemia, breast cancer and lung cancer microarray datasets. Two types of biological information were incorporated in the selection process, namely the Gene Ontology (GO) and the KEGG Pathways (KEGG). The experimental results showed that the recommended GO based models, KEGG based models, and GO-KEGG based models outperformed the expression-only models by attaining better classification accuracies with less number of genes. In the experiments, leukemia and lung cancer datasets had achieved 100% accuracies in all the classifiers with number of selected genes as small as three. On the other hand, colon cancer and breast cancer datasets achieved better classification accuracies compared with the previous integrated method, which are 95.16% and 95.88% respectively. Moreover, the proposed integrated framework proved to build informative and interpretable microarray classification models. The selected genes can be traced back to their functional annotations and association groups for reasoning and creating new hypotheses for future investigation.
format	Thesis
qualification_level	Master's degree
author	Ong, Huey Fang
author_facet	Ong, Huey Fang
author_sort	Ong, Huey Fang
title	Integrated framework with association analysis for gene selection in microarray data classification
title_short	Integrated framework with association analysis for gene selection in microarray data classification
title_full	Integrated framework with association analysis for gene selection in microarray data classification
title_fullStr	Integrated framework with association analysis for gene selection in microarray data classification
title_full_unstemmed	Integrated framework with association analysis for gene selection in microarray data classification
title_sort	integrated framework with association analysis for gene selection in microarray data classification
granting_institution	Universiti Putra Malaysia
granting_department	Faculty of Computer Science and Information Technology
publishDate	2011
url	http://psasir.upm.edu.my/id/eprint/27711/1/FSKTM%202011%2029R.pdf
_version_	1747811596562857984

Integrated framework with association analysis for gene selection in microarray data classification

Similar Items