Alignment-free distance measures for clustering Expressed Sequence Tags

Clustering of expressed sequence tags (ESTs) is a vital step in EST analysis pipeline. The main goal of clustering is to gather overlapping ESTs from the same transcript of a single gene into a distinct cluster. A simple way to cluster ESTs is by comparing their similarity in a pair-wise manner. In...

Full description

Saved in:
Bibliographic Details
Main Author: Ngo, Keng Hoong
Format: Thesis
Published: 2013
Subjects:
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Clustering of expressed sequence tags (ESTs) is a vital step in EST analysis pipeline. The main goal of clustering is to gather overlapping ESTs from the same transcript of a single gene into a distinct cluster. A simple way to cluster ESTs is by comparing their similarity in a pair-wise manner. In fact, earlier EST clustering was implemented using the alignment-based distance measures such as BLAST, FASTA, Smith-Waterman algorithm and etc. However, the main shortcoming of the alignment-based approach is the high computational cost resulting from pair-wise alignment. This makes it impractical for very large EST datasets. This has motivated the introduction of alignment-free distance measures for EST clustering. Established EST clustering methods such as d2_cluster, wcd and PEACE apply alignment-free distance measures. Performance-wise, they yield faster computation time with acceptable clustering accuracy as compared to the alignment based methods. In EST clustering, it is common to implement a windowing strategy in conjunction with the alignment-free distance measures. Some distance measures also use heuristics to speed up the comparisons. Consequently, the clustering results produced by them can vary significantly from one dataset to another. It means that the clustering performance is excellent when the distance measure is able to detect and quantify the features found in the dataset efficiently. On the other hand, it can perform poorly when it deals with another dataset with different characteristics, where the distance measure fails to capture and quantify them correctly.