Font Size: a A A

The Graphical Representation Of DNA Sequences And The Application Research Of Clustering Analysis

Posted on:2008-03-26Degree:MasterType:Thesis
Country:ChinaCandidate:Y C ZhouFull Text:PDF
GTID:2120360242965291Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The rapid development of Biology and the research on protein sequences, more and more molecular sequences data have been generated. We can gain some information about biology structure and function by analyzing these data. Bioinformatics is mainly deal with complex computations involving gene sequences, protein sequences by mathematics and computer science. The technology of data mining, especially the clustering is an important means to analyze gene sequences. This paper emphasizes on researching gene sequence graphical representation and the application of clustering technology based on the graphical representation.In this paper, a novel 3-D graphical representation with no-degeneration is presented. The new 3-D graphical has the virtue of avoiding the overlap or cross without losing biological information and containing the mainly biological characteristics of the originality sequence. In order to construct the sequence matrix, the geometrical center is introduced. The gene sequence is declared by the max eigenvalue of gene sequence matrix.The clustering technology analyzing on the gene sequence graphical representation data is the primary content. in this paper, We introduce fake F-statistic and propose a dynamic Fuzzy K-means clustering analysis technology, this clustering technology can ensure a lest inner-cluster disperse matrix trace of final clustering result and partition the points in multi-dimension to different clusters with special numbers and get best cluster number. We construct the gene graphical representation data of H5N1 gene sequences to test means of the clustering analysis, the result shows that is rational to make clustering analysis on the gene character abstract from the gene graphical representation.BIRCH clustering algorithm is a new algorithm for large datasets, but this algorithm has some defects. Considering these defects, we improve on the threshold in the CF-tree based on sum of deviation square to meliorate the pertinence between the clusters. The split factor is defined by the max diameter to overcome defect of the factor from the experience. At last, we bring the improved BIRCH clustering algorithm to analyze the gene graphical representation data elementary.
Keywords/Search Tags:gene sequence, graphical representation, Pseudo F-Statistics, Fuzzy clustering, BIRCH algorithm
PDF Full Text Request
Related items