Font Size: a A A

The Research On Gene Sequences Clustering And Classification

Posted on:2007-05-15Degree:MasterType:Thesis
Country:ChinaCandidate:J H WuFull Text:PDF
GTID:2178360185965735Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the continuous development of modern biology technology, especially the implement of the Human Genome Project, people have gradually acquired quantities of gene sequences data and it's quite necessary to analyze gene sequences data accurately and efficiently, as well as to mine potential useful information for people. Clustering and Classification are just two main methods of analyzing quantities of gene data. This paper focuses on the Clustering and Classification algorithms in gene sequences data.K-means is a common Clustering algorithm which makes members in a same class have the minimum dispersion via reassign class members in order to obtain the best clustering results. In this paper we discuss a double K-mean mode-based algorithm to modeling and clustering gene sequences data, using hidden markov models (HMMs). First, the biological character of four nucleotides ratio of homologous gene sequences which are trend to accordant is proposed to initial K-mean clustering on gene sequences data, and second, the first clustering results are used as input to train some HMMs that can denote sequences identities well. Finally, mode-based K-mean approach is adapted to clustering again, this makes the new algorithm has better quality.On the basis of studying the distributing rules of microbial nucleotides, this paper discusses a method to clustering sequential gene data of microorganism, using genetic characteristics. First, we divide each gene sequence into some arithmetic sample segments. Secondly, the clustering is done according to genetic characteristics value of the sample segments. This is an ingenious and impersonal clustering method which has high reliability. The experiment results show that this method is feasible and has comparatively better clustering quality.In the process of classifying gene sequences, if the training data's categories are not complete, then the classifying gene sequences by general classification methods will lead classes missing. As concerning this problem, this article promotes several new model measuring methods by combining the special array and structure feature of gene sequences, in order to obtain valve to dynamically adjust the number of categories by the distance matrix among models. These new methods will conquer the limitation of setting labeled class number factitiously as the actual class number, reduce the negative influence to model's iterative training caused by the incomplete categories of training data. It successfully solves the problem of class missing caused by the incomplete categories of training data.
Keywords/Search Tags:clustering, classification, genes sequences, hidden markov models, k-means
PDF Full Text Request
Related items