Modelling the classification of twilight zone proteins using structure-based phylogenetic inferences / Siti Fatimah Mohd Taha

Structural studies of proteins have become a focus point for researchers as a result of vast growth of novel proteins and their huge contribution in drug discovery. Due to their highly conserved properties, modelling and predicting function of proteins commonly rely on their structural features with...

Full description

Saved in:
Bibliographic Details
Main Author: Mohd Taha, Siti Fatimah
Format: Thesis
Language:English
Published: 2018
Online Access:https://ir.uitm.edu.my/id/eprint/79360/1/79360.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Structural studies of proteins have become a focus point for researchers as a result of vast growth of novel proteins and their huge contribution in drug discovery. Due to their highly conserved properties, modelling and predicting function of proteins commonly rely on their structural features with reference to protein classification. Proteins with close evolutionary relationship usually possess significant sequence similarity and are mostly studied using sequence-based approaches. However, evolutionary changes such as mutations can largely affect the sequences and thus, result in unreliable classification when dealing with highly dissimilar sequences of homologous proteins. As structures are highly conserved during evolution, the structure-based approach is the most suitable to infer homology between distantly related proteins. Previous studies have primarily focussed on finding protein homology rather than classifying proteins into families to represent evolutionary relationship. So far, there has been little discussion on the use of structural similarity for protein classification. Yet, no study has examined the accuracy of structural alignment tools to support an accurate phylogenetic classification of proteins. This thesis represents a study on structure-based methods in aligning twilight zone proteins to provide an accurate model for protein classification. A total of 716 proteins were chosen randomly from 4 major classes defined in the SCOPe database. All a proteins (A), all P proteins (B), a/p proteins (C) and a+P (D) proteins were represented in these classes. Structural alignment was conducted using six methods provided by five structural alignment tools namely CE, FATCAT, GANGSTA+, Matras and TM-Align. A sequence-based method was also conducted using T-COFFEE to provide a comparison with the accuracy and reliability of the structural methods. A distance-based phylogenetic approach, UPGMA, was then implemented using RMSD as inputs to produce classification trees. Evaluation of trees was performed by manually comparing the arrangement of clusters against the SCOPe v2.5 classification. External clustering metrics such as ARi were also used to validate the clusters. The results have shown that the structure-based approaches were more reliable than the sequence approach for classifying the twilight zone proteins. ARi scores obtained from structural trees outperformed the sequence approach for all folds at the superfamily level and 91.67% of folds at the family level. CE performed best for two major classes A, and C, whereas proteins from classes B, and D were best aligned using TM-Align. Based on the findings, a pipeline was developed to automate the classification analysis, and was tested in two case studies that involved Alzheimer's disease proteins, and substrate binding-proteins (SBP) respectively. Both case studies proved the feasibility of the proposed pipeline to provide a reliable classification of twilight zone proteins and serve as a guideline for future studies.