Integrated framework with association analysis for gene selection in microarray data classification

Microarray data classification is one of the major interests in health informatics that aims at discovering hidden patterns in gene expression profiles. The main challenge in building this classification system is the curse of dimensionality problem. Therefore, gene selection is an indispensable tas...

Full description

Saved in:
Bibliographic Details
Main Author: Ong, Huey Fang
Format: Thesis
Language:English
English
Published: 2011
Subjects:
Online Access:http://psasir.upm.edu.my/id/eprint/27711/1/FSKTM%202011%2029R.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
id my-upm-ir.27711
record_format uketd_dc
spelling my-upm-ir.277112014-04-10T04:22:58Z Integrated framework with association analysis for gene selection in microarray data classification 2011-04 Ong, Huey Fang Microarray data classification is one of the major interests in health informatics that aims at discovering hidden patterns in gene expression profiles. The main challenge in building this classification system is the curse of dimensionality problem. Therefore, gene selection is an indispensable task in microarray data classification to identify smaller sets of relevant genes. However, most of the existing gene selection methods are statistical analyses and purely based on gene expression values in the identification of differentially expressed genes. As a result, the selected genes might be false positives and are not biologically meaningful. The purpose of this study was to integrate microarray dataset with additional biological information for selecting genes that are not only differentially expressed but also informative for classifiers. To achieve that, an integrated framework with a new gene selection method was developed to improve classification performance in terms of accuracy and number of selected genes. The proposed gene selection method combined the strength of both filter method and association analysis to identify a set of discriminative and informative genes. Association analysis was employed to integrate more than one type of biological information in the same transaction database, and also to identify groups of genes that are frequently co-occurred in target samples. Modifications have been made on the existing association algorithm for mining frequent itemsets, where genes in each itemset were sorted according to their discriminative scores rather than according to lexicographic order. In addition to that, discriminative scores were used to compute interestingness of frequent itemsets before ranking them. The proposed integrated framework has been tested on colon cancer, leukemia, breast cancer and lung cancer microarray datasets. Two types of biological information were incorporated in the selection process, namely the Gene Ontology (GO) and the KEGG Pathways (KEGG). The experimental results showed that the recommended GO based models, KEGG based models, and GO-KEGG based models outperformed the expression-only models by attaining better classification accuracies with less number of genes. In the experiments, leukemia and lung cancer datasets had achieved 100% accuracies in all the classifiers with number of selected genes as small as three. On the other hand, colon cancer and breast cancer datasets achieved better classification accuracies compared with the previous integrated method, which are 95.16% and 95.88% respectively. Moreover, the proposed integrated framework proved to build informative and interpretable microarray classification models. The selected genes can be traced back to their functional annotations and association groups for reasoning and creating new hypotheses for future investigation. DNA microarrays - Classification Gene expression Data mining 2011-04 Thesis http://psasir.upm.edu.my/id/eprint/27711/ http://psasir.upm.edu.my/id/eprint/27711/1/FSKTM%202011%2029R.pdf application/pdf en public masters Universiti Putra Malaysia DNA microarrays - Classification Gene expression Data mining Faculty of Computer Science and Information Technology English
institution Universiti Putra Malaysia
collection PSAS Institutional Repository
language English
English
topic DNA microarrays - Classification
Gene expression
Data mining
spellingShingle DNA microarrays - Classification
Gene expression
Data mining
Ong, Huey Fang
Integrated framework with association analysis for gene selection in microarray data classification
description Microarray data classification is one of the major interests in health informatics that aims at discovering hidden patterns in gene expression profiles. The main challenge in building this classification system is the curse of dimensionality problem. Therefore, gene selection is an indispensable task in microarray data classification to identify smaller sets of relevant genes. However, most of the existing gene selection methods are statistical analyses and purely based on gene expression values in the identification of differentially expressed genes. As a result, the selected genes might be false positives and are not biologically meaningful. The purpose of this study was to integrate microarray dataset with additional biological information for selecting genes that are not only differentially expressed but also informative for classifiers. To achieve that, an integrated framework with a new gene selection method was developed to improve classification performance in terms of accuracy and number of selected genes. The proposed gene selection method combined the strength of both filter method and association analysis to identify a set of discriminative and informative genes. Association analysis was employed to integrate more than one type of biological information in the same transaction database, and also to identify groups of genes that are frequently co-occurred in target samples. Modifications have been made on the existing association algorithm for mining frequent itemsets, where genes in each itemset were sorted according to their discriminative scores rather than according to lexicographic order. In addition to that, discriminative scores were used to compute interestingness of frequent itemsets before ranking them. The proposed integrated framework has been tested on colon cancer, leukemia, breast cancer and lung cancer microarray datasets. Two types of biological information were incorporated in the selection process, namely the Gene Ontology (GO) and the KEGG Pathways (KEGG). The experimental results showed that the recommended GO based models, KEGG based models, and GO-KEGG based models outperformed the expression-only models by attaining better classification accuracies with less number of genes. In the experiments, leukemia and lung cancer datasets had achieved 100% accuracies in all the classifiers with number of selected genes as small as three. On the other hand, colon cancer and breast cancer datasets achieved better classification accuracies compared with the previous integrated method, which are 95.16% and 95.88% respectively. Moreover, the proposed integrated framework proved to build informative and interpretable microarray classification models. The selected genes can be traced back to their functional annotations and association groups for reasoning and creating new hypotheses for future investigation.
format Thesis
qualification_level Master's degree
author Ong, Huey Fang
author_facet Ong, Huey Fang
author_sort Ong, Huey Fang
title Integrated framework with association analysis for gene selection in microarray data classification
title_short Integrated framework with association analysis for gene selection in microarray data classification
title_full Integrated framework with association analysis for gene selection in microarray data classification
title_fullStr Integrated framework with association analysis for gene selection in microarray data classification
title_full_unstemmed Integrated framework with association analysis for gene selection in microarray data classification
title_sort integrated framework with association analysis for gene selection in microarray data classification
granting_institution Universiti Putra Malaysia
granting_department Faculty of Computer Science and Information Technology
publishDate 2011
url http://psasir.upm.edu.my/id/eprint/27711/1/FSKTM%202011%2029R.pdf
_version_ 1747811596562857984