The Research On Gene Sequences Clustering And Classification

Posted on:2007-05-15

Degree:Master

Type:Thesis

Country:China

Candidate:J H Wu

Full Text:PDF

GTID:2178360185965735

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

With the continuous development of modern biology technology, especially the implement of the Human Genome Project, people have gradually acquired quantities of gene sequences data and it's quite necessary to analyze gene sequences data accurately and efficiently, as well as to mine potential useful information for people. Clustering and Classification are just two main methods of analyzing quantities of gene data. This paper focuses on the Clustering and Classification algorithms in gene sequences data.K-means is a common Clustering algorithm which makes members in a same class have the minimum dispersion via reassign class members in order to obtain the best clustering results. In this paper we discuss a double K-mean mode-based algorithm to modeling and clustering gene sequences data, using hidden markov models (HMMs). First, the biological character of four nucleotides ratio of homologous gene sequences which are trend to accordant is proposed to initial K-mean clustering on gene sequences data, and second, the first clustering results are used as input to train some HMMs that can denote sequences identities well. Finally, mode-based K-mean approach is adapted to clustering again, this makes the new algorithm has better quality.On the basis of studying the distributing rules of microbial nucleotides, this paper discusses a method to clustering sequential gene data of microorganism, using genetic characteristics. First, we divide each gene sequence into some arithmetic sample segments. Secondly, the clustering is done according to genetic characteristics value of the sample segments. This is an ingenious and impersonal clustering method which has high reliability. The experiment results show that this method is feasible and has comparatively better clustering quality.In the process of classifying gene sequences, if the training data's categories are not complete, then the classifying gene sequences by general classification methods will lead classes missing. As concerning this problem, this article promotes several new model measuring methods by combining the special array and structure feature of gene sequences, in order to obtain valve to dynamically adjust the number of categories by the distance matrix among models. These new methods will conquer the limitation of setting labeled class number factitiously as the actual class number, reduce the negative influence to model's iterative training caused by the incomplete categories of training data. It successfully solves the problem of class missing caused by the incomplete categories of training data.

Keywords/Search Tags:

clustering, classification, genes sequences, hidden markov models, k-means

PDF Full Text Request

Related items

1	Computational Methods of Hidden Markov Models With Respect To CpG Island Prediction in DNA Sequences
2	A machine learning approach to query time-series microarray data sets for functionally related genes using hidden Markov models
3	Research On The Clustering Analysis Algorithms In Bioinformatics
4	Modeling And Control Of Networked Control Systems Based On Hidden Markov Models
5	The Research Of Texture Image Segmentation Algorithm Based On Fuzzy C-means And The Coupled Hidden Markov Random Field Models
6	Moving Target Trajectory Classification And Identification Of Research
7	Classification of textures using noncausal hidden Markov models
8	The Research For Clustering Methods Based On Hidden Markov Models For Trajectories
9	Segmented chirp features and hidden Gauss-Markov models for transient signal classification
10	Pulse Classification Based On Hidden Markov Model