Analyzing DNA Sequences Using Clustering Algorithm

Data mining gives a bright prospective in DNA sequences analysis through its concepts and techniques. This study carries out exploratory data analysis method to cluster DNA sequences.Feature vectors have been developed to map the DNA sequences to a twelve-dimensional vector in the space. Lysozyme, M...

Full description

Saved in:
Bibliographic Details
Main Author: Alhersh, Taha Talib Ragheb
Format: Thesis
Language:eng
eng
Published: 2009
Subjects:
Online Access:https://etd.uum.edu.my/1913/1/Taha_Taleb_Ragheb_Alhersh.pdf
https://etd.uum.edu.my/1913/2/1.Taha_Taleb_Ragheb_Alhersh.pdf
Tags: Add Tag
No Tags, Be the first to tag this record!
id my-uum-etd.1913
record_format uketd_dc
institution Universiti Utara Malaysia
collection UUM ETD
language eng
eng
topic QA76 Computer software
spellingShingle QA76 Computer software
Alhersh, Taha Talib Ragheb
Analyzing DNA Sequences Using Clustering Algorithm
description Data mining gives a bright prospective in DNA sequences analysis through its concepts and techniques. This study carries out exploratory data analysis method to cluster DNA sequences.Feature vectors have been developed to map the DNA sequences to a twelve-dimensional vector in the space. Lysozyme, Myoglobin and Rhodopsin protein families have been tested in this space. The results of DNA sequences comparison among homologous sequences give close distances between their characterization vectors which are easily distinguishable from non-homologous in experiment it with a fixed DNA sequence size that does not exceed the maximum length of the shortest DNA sequence. Global comparison for multiple DNA sequences simultaneously presented in the genomic space is the main advantage of this work by applying direct comparison of the corresponding characteristic vectors distances. The novelty of this work is that for the new DNA sequence, there is no need to compare the new DNA sequence with the whole DNA sequences length, just the comparison focused on a fixed number of all the sequences in a way that does not exceed the maximum length of the new DNA sequence. In other words, parts of the DNA sequence can identify the functionality of the DNA sequence, and make it clustered with its family members.
format Thesis
qualification_name masters
qualification_level Master's degree
author Alhersh, Taha Talib Ragheb
author_facet Alhersh, Taha Talib Ragheb
author_sort Alhersh, Taha Talib Ragheb
title Analyzing DNA Sequences Using Clustering Algorithm
title_short Analyzing DNA Sequences Using Clustering Algorithm
title_full Analyzing DNA Sequences Using Clustering Algorithm
title_fullStr Analyzing DNA Sequences Using Clustering Algorithm
title_full_unstemmed Analyzing DNA Sequences Using Clustering Algorithm
title_sort analyzing dna sequences using clustering algorithm
granting_institution Universiti Utara Malaysia
granting_department College of Arts and Sciences (CAS)
publishDate 2009
url https://etd.uum.edu.my/1913/1/Taha_Taleb_Ragheb_Alhersh.pdf
https://etd.uum.edu.my/1913/2/1.Taha_Taleb_Ragheb_Alhersh.pdf
_version_ 1747827231075336192
spelling my-uum-etd.19132022-04-21T03:28:29Z Analyzing DNA Sequences Using Clustering Algorithm 2009 Alhersh, Taha Talib Ragheb College of Arts and Sciences (CAS) College of Arts and Sciences QA76 Computer software Data mining gives a bright prospective in DNA sequences analysis through its concepts and techniques. This study carries out exploratory data analysis method to cluster DNA sequences.Feature vectors have been developed to map the DNA sequences to a twelve-dimensional vector in the space. Lysozyme, Myoglobin and Rhodopsin protein families have been tested in this space. The results of DNA sequences comparison among homologous sequences give close distances between their characterization vectors which are easily distinguishable from non-homologous in experiment it with a fixed DNA sequence size that does not exceed the maximum length of the shortest DNA sequence. Global comparison for multiple DNA sequences simultaneously presented in the genomic space is the main advantage of this work by applying direct comparison of the corresponding characteristic vectors distances. The novelty of this work is that for the new DNA sequence, there is no need to compare the new DNA sequence with the whole DNA sequences length, just the comparison focused on a fixed number of all the sequences in a way that does not exceed the maximum length of the new DNA sequence. In other words, parts of the DNA sequence can identify the functionality of the DNA sequence, and make it clustered with its family members. 2009 Thesis https://etd.uum.edu.my/1913/ https://etd.uum.edu.my/1913/1/Taha_Taleb_Ragheb_Alhersh.pdf text eng public https://etd.uum.edu.my/1913/2/1.Taha_Taleb_Ragheb_Alhersh.pdf text eng public masters masters Universiti Utara Malaysia Abonyi, J., & Feil, B. (2005). Computational Intelligence in Data Mining. Informatica,29, 3-12.Aksoy, S., & Haralick, R. M. (1999). Graph–Theoretic Clustering for Image Grouping and Retrieval. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'99), 1, 1063. Anastassiou, D. (2000). Frequency-domain analysis of biomolecular sequences.Bioinformatics, 16(4), 1073-1081. Ansari, A., & Viswanathan, R. (1992). Application of Expectation-Maximization Algorithm to the Detection of Direct-Sequence Signal in pulsed Noise Jamming. IEEE Military Communications Conference, 3, 811-815.Apon, A., Mache, J., Buyya, R., & Jin, H. (2004). Cluster Computing in the Classroom and Integration with Computing Curricula 2001. IEEE Transactions on Education, 47(2), 188-195. Arasa, N., Oommenb, B. J., & Altınelc, I. K. (1999). The Kohonen network incorporating explicit statistics and its application to the travelling salesman problem. Neural Networks, 12(9), 1273-1284.Ayre, L. B. (2006). Data Mining for Information Professionals.Bach, F. R., & Jordan, M. I. (2003). Learning Spectral Clustering. Learning graphical models with Mercer kernels in Advances Neural Inform, 1, 1009-1016.Bolshoy, A., & Volkovich, Z. (2008). Whole-genome prokaryotic clustering based on gene lengths. Discrete Applied Mathematics, 157(10), 2370-2377.65 Borman, S. (2009). The Expectation Maximization Algorithm A short tutorial.Carvalho, F. A. T. (2006). Fuzzy clustering algorithms for symbolic interval data based on adaptive and non-adaptive Euclidean distances.Draghici S., Graziano, F., Kettoola, S., Sethi, I., & Towfic, G. (2003). Mining HIV dynamics using independent component analysis. Bioinformatics, 19(8), 981-986.Erban, G., & Moldovan, G. S. (2006). A Comparison of Clustering Techniques in Aspect Mining. Informatica, 1, 69-78.Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From Data Mining to Knowledge Discovery in Databases.FitzGerald, P. C., Shlyakhtenko, A., Mir, A. A., & Vinson, C. (2004). Clustering of DNA Sequences in Human Promoters. Genome Res, 14, 1562-1574.Gates, M. A. (1985). Simpler DNA sequence representations. Nature, 31, 219.Ghanem M., Chortaras, A., Guo, Y., Rowe, A., & Ratcliffe, J. (2005). A Grid Infrastructure for Mixed Bioinformatics Data and Text Mining.Graham, J., Page, C. D., & Kamal, A. (2003). Accelerating the Drug Design Process through Parallel Inductive Logic Programming Data Mining.Grammalidis, N., Bleris, L., & Strintzis, M. G. (2002). Using the Expectation-Maximization Algorithm for Depth Estimation and Segmentation of Multi-view Images.Guinepain, S., & Gruenwald, L. (2006). Automatic Database Clustering Using Data Mining.Guo, X., & Nandy, A. (2002). Numerical characterization of DNA sequences in a 2-D graphical representation scheme of low degeneracy.66 Hebert, P. D. N., Cywinska, A., Ball, S. L., & deWaard, J. R. (2003). Biological identifications through DNA barcodes. Hu, X. O., & Pan, Y. (Eds.). (2007). Knowledge Discovery in Bioinformatics Techniques,Methods, and Applications. Hoboken: Wiley.Huang, G., Liao, B., Li, Y., & Yu, Y.(2009). Similarity studies of DNA sequences based on a new 2D graphical representation.Irene, M. M. (1999). Hierarchical Clustering. Retrieved September 29, 2009, from http://www.cse.iitb.ac.in/dbms/Data/Courses/CS632/1999/clustering/node3.html Jain, A. K., & Dubes, R. C. (1988). Algorithms for Clustering Data. Upper Saddle River: Prentice-Hall.Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data Clustering: A Review. ACM Computing Surveys, 31(3).Jenssen, R., Hild, K. E., Erdogmus, D., Principe, J. C., & Eltoft, T. (n.d.). Clustering using Renyi’s Entropy.Kauer, G., & Blocker, H. (2003). Applying signal theory to the analysis of biomolecules. Bioinformatics, 19(16), 2016-2021.Kozobay-Avrahama, L., Hosid, S., Volkovich, Z., & Bolshoy, A. (2008). Prokaryote clustering based on DNA curvature distributions.Liu, L., Ho, Y., & Yau, S. (2006). Clustering DNA sequences by feature vectors.Lv, T., Huang, S., Zhang, X., & Wang, Z. (2006). A Robust Hierarchical Clustering Algorithm and its Application in 3D Model Retrieval.Myller, N., Suhonen, J., & Sutinen, E. (2002). Using Data Mining for Improving Web- Based Course Design.67 Ng, H. P., Ong, S. H., Foong, K. W. C., Goh, P. S., & Nowinski, W. L. (2006). Medical Image Segmentation Using K-Means Clustering and Improved Watershed Algorithm.Paccanaro, A., Casbon, J. A., & Saqi, M. A. S. (2006). Spectral clustering of protein sequences. Palace, B. (1996). Data Mining. Retrieved September 29, 2009, from http://www.anderson.ucla.edu/faculty/jason.frand/teacher/technologies/palace/index.htm Qi, Z., & Qi, X., (2009). Numerical characterization of DNA sequences based on digital signal method.Randi, M., Vracko, M., Ler, N., & Plavsi, D. (2002). Novel 2-D graphical representation of DNA sequences and their numerical characterization. Schenker, A. (2003). Graph-Theoretic Techniques for Web Content Mining.Silverman, B. D., & Linsker, R. (1986). A measure of DNA periodicity.Silverman, J. F., & Cooper, D. B. (1988). Bayesian Clustering for Unsupervised Estimation of Surface and Texture Models.Song, J., & Tang, H. (2005). A new 2-D graphical representation of DNA sequences and their numerical characterization.Stoeckle, M. (2003). Taxonomy, DNA, and the Bar Code of Life. BioScience, 3(9), 796-797.Tan, P., Steinbach, M., & Kumar, V. (2006). Introduction to Data Mining. Boston:Pearson Education.68 Valgren, C., Duckett, T., & Lilienthal, A. (2007). Incremental Spectral Clustering and Its Application To Topological Mapping. IEEE International Conference on Robotics and Automation.Vinod, V. V., Chaudhury, S., Mukherjee, J., & Ghose, S. (1994). A Connectionist Approach for Clustering with Applications in Image Analysis. Visnick, L. (2003). Clustering Techniques.Voss, R. (1992). Evolution of long-range fractal correlation and 1/f noise in DNA base sequences. Physical Review Letters, 68, 3805-3808.Wang, W., & Johnson, D. H. (2002). Computing linear transforms of symbolic signals Signal Processing. IEEE Trans. Sig. Proc., 50(3), 628-634.Weiming, H. X. L., & Zhang, Z. (2007). Corner Detection of Contour Images Using Spectral Clustering.XL Miner (n.d.). Hierarchical Clustering. Retrieved September 29, 2009, from http://www.resample.com/xlminer/help/HClst/HClst_intro.htm Zhang, H., Ho, T., & Linz, M. (2004). An Evolutionary K-Means Algorithm for Clustering Time Series Data. Zhang, Q., Peng, Q., & Xu, T. (2008). DNA splice site sequences clustering method for conservativeness analysis. Zien, A. , Ratsch, G., Mika, S., Scholkopf, B., Lemmen, C., Smola, A., Lengauer, T., & Muller, K. R. (n.d.).Engineering Support Vector Machine Kernels That Recognize Translation Initiation Sites.