Data mining of biological sequences

Posted on:2006-01-23

Degree:Ph.D

Type:Dissertation

University:University of Illinois at Chicago

Candidate:Liu, Libin

Full Text:PDF

GTID:1458390005492499

Subject:Mathematics

Abstract/Summary:

Data mining concepts and techniques provide bright prospective in biological sequence analysis. This study carries out exploratory data analysis method, descriptive modeling and predictive modeling in biological sequence analysis.; DNA sequences have been visually displayed in the polar coordinate system. This system totally eliminates the degeneracy that caused problems in previous graphical representation. The new system also gives out the information about nucleotides and amino acids.; To cluster biological sequences, feature vectors have been developed to map the biological sequences to the points in twelve-dimensional space. Several protein families have been tested in this space. Members from the same family cluster together and members from different families stay away from each other.; To find out genes from genomic sequences, we need to find out the boundary between exon and intron. To discriminate the acceptor (boundaries from intron to exon) and donor (boundaries from exon to intron) sites, inhomogeneous Markov chain models have been used. DNA sequences are divided into m sections. For each section, transfer matrix is estimated from the training data. Then the formula of discriminator is set up to predict the type of boundary of the test DNA data. The accuracy rate of two states system is up to 96%. For the three states system, it has the capability of discriminate donor, acceptor and neither. The DNA sequences are divided into m sections and are preprocessed through inhomogeneous Markov chain models before fed into neural network. The structure of neural network is 3 layers feed forward network. The number of neurons in the input layer is 3m and the number of neurons in the output layer is 3. The indicator vectors are used to represent three states. This three states predictive model has been tested in primate gene dataset. The accuracy rate is as high as 98%, much higher than the accuracies of previous work.

Keywords/Search Tags:

Data, Biological, Sequences

Related items

1	Data mining of biological sequences
2	Research On Algorithm Of Comparing Bio-sequences Similarity
3	Research And Implementation For Similarity Search Algorithm Of Biological Sequences
4	Research And Application Of Index On Biological Sequences
5	Comparative analysis of biological sequences through information visualization
6	Research And Implementation Of Compress Storage Of Biological Sequences And Indices
7	Management of biological sequences using suffix trees
8	The Study Of Sequences With Low(ODD) Even Correlation
9	Design And Application Of Kernels For Biological Sequences
10	Bio-data Organizing And Processing Based On XML