Identifying cancer gene subtypes from gene expression by co-clustering algorithm and support vector machine

Cancer subtype information is significant to understand tumour heterogeneity. Present methods to find cancer subtypes have focused on utilizing traditional clustering algorithms such as hierarchical clustering. Since most of these methods depend on high dimensional data, the drawback is to divide th...

Full description

Saved in:
Bibliographic Details
Main Author: Machap, Logenthiran
Format: Thesis
Language:English
Published: 2021
Subjects:
Online Access:http://eprints.utm.my/id/eprint/96282/1/LogenthiranMachapPSC2021.pdf.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Cancer subtype information is significant to understand tumour heterogeneity. Present methods to find cancer subtypes have focused on utilizing traditional clustering algorithms such as hierarchical clustering. Since most of these methods depend on high dimensional data, the drawback is to divide the genes into different clusters, where a gene or a condition only belongs to one cluster. A gene may contribute to more than one biological process, so a gene may belong to multiple clusters. Besides, the centroid in the objective function of network-assisted coclustering for the identification of cancer subtypes (NCIS) dragged with outliers. So, these outliers get their cluster instead of being ignored. Hence, this research is focusing on improving the NCIS method. Enhanced NCIS (iNCIS) is basically assigned weights to genes base on a gene interaction network, and it imperatively optimizes the sum-squared residue to get co-clusters. Next, supervised infinite feature selection with multiple support vector machine (SinfFS-mSVM) is proposed to obtain significant genes from a high dimensional data by using the classes obtained from iNCIS and improve the accuracy of classification. The effectiveness of iNCIS and SinfFS-mSVM is being evaluated on a large-scale Breast Cancer (BRCA) and Glioblastoma Multiforme (GBM) from The Cancer Genome Atlas (TCGA) project. From the implementation, there are five breast cancer gene subtypes and four glioblastoma multiforme cancer gene subtypes that have been successfully identified. The weighted co-clustering approach in iNCIS provides a unique solution to integrate gene network interaction into the clustering process. The improvement of the co-clustering Rand Index and F1-measure is 54.5% and 33.9% for BRCA and 34.2% and 31.5% for GBM. Meanwhile, a significant gene subset with higher classification accuracy was selected from SinfFS-mSVM. The classification accuracy for the selected gene subset improved by 3.00% and 2.99% for BRCA and GBM, correspondingly. Furthermore, biological validation conducted on the selected genes from each subtype is to justify the validity of the results. In conclusion, the empirical study on large-scale cancer datasets using iNCIS and SinfFS-mSVM comprehensively find cancer gene subtypes and genes by achieving higher clustering and classification accuracy. Future works are needed to integrate more comprehensive gene network information and to select optimal parameters.