Font Size: a A A

Research And Application Of DNA Clustering Algorithm Based On Intelligent Algorithm

Posted on:2011-06-20Degree:MasterType:Thesis
Country:ChinaCandidate:L ZhangFull Text:PDF
GTID:2120360308465015Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
With the continuous development of modern biological technology, especially the implement of the Human Genome Project, people have gradually acquired quantities of gene sequences data. Faced with such a large number of genetic sequence data, only a small part of them we have already known their functions, but most of the gene function is unknown. The clustering technology of Data mining is the technology capable of analysising a large number of gene data. Therefore, by clustering technology, these gene sequences are clustered, and we get some classes. because the gene sequences from one class have similar functions, So that, we can speculate the functions of unknown gene sequences using the known ones. The current research in the field of bioinformatics, clustering analysis has been widely used. The key question of clustering of biological sequences is how to characterize the similarity between sequences. The linear arrangement of the biological sequence data itself is sometimes difficult to reflect the degree of similarity, so in some cases, some similarity measure failure. Thus, affecting the quality of clustering results. Therefore, if the similarity measure designed starting entirely from the sequence itself, it will not get the real clustering results up to the biological observations, It brings some difficulties to the evolution study of DNA sequences. With the deeply research of the graphical expression of DNA sequences, Randic first proposed the use of graphical expression of DNA sequences to study the clustering of gene sequences. By this idea, We can cluster the sequences by the mathematical characteristics collected by the the graphical expression of DNA sequences. referring to existing two-dimensional graphical representation based on base Symmetry, I made some improvement and give a new graphical representation method of DNA sequences. The improved method can make a more space-saving, and this method can also reflect some of the biological features of DNA sequences more clearly. So according to mapping rules, each DNA sequence is translated into three two-dimensional curves, and then extract featural matrixs from the curves, and then cluster the DNA sequences using the matrix invariant, so that, a DNA sequence is transformed into a multi-dimensional data, and the clustering of DNA sequences is transformed into the clustering of multi-dimensional data .The existing common clustering algorithms of multi-dimensional data usually require giving the number of clusters k in advance. However, in most cases, the number of clusters k can not be determined in advance, so the best number of clusters k needs to be optimized. In this paper, I use the clustering algorithm based on particle swarm optimization. In order to solve that the clustering algorithm based on PSO can not determine the number of clusters k, by the k-means algorithm, achieve the best number of cluster k and the structuring of the cluster validity function. The testing has proved the effectiveness of cluster detection function to determine the best number of clusters, and because the introduction of the weights of classes, so that the detection function can be better applied to real data analysis.
Keywords/Search Tags:DNA sequence, graphical representation, Particle Swarm Algorithm, Clustering Optimization
PDF Full Text Request
Related items