Identifying cancer gene subtypes from gene expression by co-clustering algorithm and support vector machine

Cancer subtype information is significant to understand tumour heterogeneity. Present methods to find cancer subtypes have focused on utilizing traditional clustering algorithms such as hierarchical clustering. Since most of these methods depend on high dimensional data, the drawback is to divide th...

Full description

Saved in:
Bibliographic Details
Main Author: Machap, Logenthiran
Format: Thesis
Language:English
Published: 2021
Subjects:
Online Access:http://eprints.utm.my/id/eprint/96282/1/LogenthiranMachapPSC2021.pdf.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
id my-utm-ep.96282
record_format uketd_dc
spelling my-utm-ep.962822022-07-05T08:07:14Z Identifying cancer gene subtypes from gene expression by co-clustering algorithm and support vector machine 2021 Machap, Logenthiran QA75 Electronic computers. Computer science Cancer subtype information is significant to understand tumour heterogeneity. Present methods to find cancer subtypes have focused on utilizing traditional clustering algorithms such as hierarchical clustering. Since most of these methods depend on high dimensional data, the drawback is to divide the genes into different clusters, where a gene or a condition only belongs to one cluster. A gene may contribute to more than one biological process, so a gene may belong to multiple clusters. Besides, the centroid in the objective function of network-assisted coclustering for the identification of cancer subtypes (NCIS) dragged with outliers. So, these outliers get their cluster instead of being ignored. Hence, this research is focusing on improving the NCIS method. Enhanced NCIS (iNCIS) is basically assigned weights to genes base on a gene interaction network, and it imperatively optimizes the sum-squared residue to get co-clusters. Next, supervised infinite feature selection with multiple support vector machine (SinfFS-mSVM) is proposed to obtain significant genes from a high dimensional data by using the classes obtained from iNCIS and improve the accuracy of classification. The effectiveness of iNCIS and SinfFS-mSVM is being evaluated on a large-scale Breast Cancer (BRCA) and Glioblastoma Multiforme (GBM) from The Cancer Genome Atlas (TCGA) project. From the implementation, there are five breast cancer gene subtypes and four glioblastoma multiforme cancer gene subtypes that have been successfully identified. The weighted co-clustering approach in iNCIS provides a unique solution to integrate gene network interaction into the clustering process. The improvement of the co-clustering Rand Index and F1-measure is 54.5% and 33.9% for BRCA and 34.2% and 31.5% for GBM. Meanwhile, a significant gene subset with higher classification accuracy was selected from SinfFS-mSVM. The classification accuracy for the selected gene subset improved by 3.00% and 2.99% for BRCA and GBM, correspondingly. Furthermore, biological validation conducted on the selected genes from each subtype is to justify the validity of the results. In conclusion, the empirical study on large-scale cancer datasets using iNCIS and SinfFS-mSVM comprehensively find cancer gene subtypes and genes by achieving higher clustering and classification accuracy. Future works are needed to integrate more comprehensive gene network information and to select optimal parameters. 2021 Thesis http://eprints.utm.my/id/eprint/96282/ http://eprints.utm.my/id/eprint/96282/1/LogenthiranMachapPSC2021.pdf.pdf application/pdf en public http://dms.library.utm.my:8080/vital/access/manager/Repository/vital:143093 phd doctoral Universiti Teknologi Malaysia Faculty of Engineering - School of Computing
institution Universiti Teknologi Malaysia
collection UTM Institutional Repository
language English
topic QA75 Electronic computers
Computer science
spellingShingle QA75 Electronic computers
Computer science
Machap, Logenthiran
Identifying cancer gene subtypes from gene expression by co-clustering algorithm and support vector machine
description Cancer subtype information is significant to understand tumour heterogeneity. Present methods to find cancer subtypes have focused on utilizing traditional clustering algorithms such as hierarchical clustering. Since most of these methods depend on high dimensional data, the drawback is to divide the genes into different clusters, where a gene or a condition only belongs to one cluster. A gene may contribute to more than one biological process, so a gene may belong to multiple clusters. Besides, the centroid in the objective function of network-assisted coclustering for the identification of cancer subtypes (NCIS) dragged with outliers. So, these outliers get their cluster instead of being ignored. Hence, this research is focusing on improving the NCIS method. Enhanced NCIS (iNCIS) is basically assigned weights to genes base on a gene interaction network, and it imperatively optimizes the sum-squared residue to get co-clusters. Next, supervised infinite feature selection with multiple support vector machine (SinfFS-mSVM) is proposed to obtain significant genes from a high dimensional data by using the classes obtained from iNCIS and improve the accuracy of classification. The effectiveness of iNCIS and SinfFS-mSVM is being evaluated on a large-scale Breast Cancer (BRCA) and Glioblastoma Multiforme (GBM) from The Cancer Genome Atlas (TCGA) project. From the implementation, there are five breast cancer gene subtypes and four glioblastoma multiforme cancer gene subtypes that have been successfully identified. The weighted co-clustering approach in iNCIS provides a unique solution to integrate gene network interaction into the clustering process. The improvement of the co-clustering Rand Index and F1-measure is 54.5% and 33.9% for BRCA and 34.2% and 31.5% for GBM. Meanwhile, a significant gene subset with higher classification accuracy was selected from SinfFS-mSVM. The classification accuracy for the selected gene subset improved by 3.00% and 2.99% for BRCA and GBM, correspondingly. Furthermore, biological validation conducted on the selected genes from each subtype is to justify the validity of the results. In conclusion, the empirical study on large-scale cancer datasets using iNCIS and SinfFS-mSVM comprehensively find cancer gene subtypes and genes by achieving higher clustering and classification accuracy. Future works are needed to integrate more comprehensive gene network information and to select optimal parameters.
format Thesis
qualification_name Doctor of Philosophy (PhD.)
qualification_level Doctorate
author Machap, Logenthiran
author_facet Machap, Logenthiran
author_sort Machap, Logenthiran
title Identifying cancer gene subtypes from gene expression by co-clustering algorithm and support vector machine
title_short Identifying cancer gene subtypes from gene expression by co-clustering algorithm and support vector machine
title_full Identifying cancer gene subtypes from gene expression by co-clustering algorithm and support vector machine
title_fullStr Identifying cancer gene subtypes from gene expression by co-clustering algorithm and support vector machine
title_full_unstemmed Identifying cancer gene subtypes from gene expression by co-clustering algorithm and support vector machine
title_sort identifying cancer gene subtypes from gene expression by co-clustering algorithm and support vector machine
granting_institution Universiti Teknologi Malaysia
granting_department Faculty of Engineering - School of Computing
publishDate 2021
url http://eprints.utm.my/id/eprint/96282/1/LogenthiranMachapPSC2021.pdf.pdf
_version_ 1747818655365726208