Font Size: a A A

Data mining of biological sequences

Posted on:2006-01-23Degree:Ph.DType:Dissertation
University:University of Illinois at ChicagoCandidate:Liu, LibinFull Text:PDF
GTID:1458390005492499Subject:Mathematics
Abstract/Summary:
Data mining concepts and techniques provide bright prospective in biological sequence analysis. This study carries out exploratory data analysis method, descriptive modeling and predictive modeling in biological sequence analysis.; DNA sequences have been visually displayed in the polar coordinate system. This system totally eliminates the degeneracy that caused problems in previous graphical representation. The new system also gives out the information about nucleotides and amino acids.; To cluster biological sequences, feature vectors have been developed to map the biological sequences to the points in twelve-dimensional space. Several protein families have been tested in this space. Members from the same family cluster together and members from different families stay away from each other.; To find out genes from genomic sequences, we need to find out the boundary between exon and intron. To discriminate the acceptor (boundaries from intron to exon) and donor (boundaries from exon to intron) sites, inhomogeneous Markov chain models have been used. DNA sequences are divided into m sections. For each section, transfer matrix is estimated from the training data. Then the formula of discriminator is set up to predict the type of boundary of the test DNA data. The accuracy rate of two states system is up to 96%. For the three states system, it has the capability of discriminate donor, acceptor and neither. The DNA sequences are divided into m sections and are preprocessed through inhomogeneous Markov chain models before fed into neural network. The structure of neural network is 3 layers feed forward network. The number of neurons in the input layer is 3m and the number of neurons in the output layer is 3. The indicator vectors are used to represent three states. This three states predictive model has been tested in primate gene dataset. The accuracy rate is as high as 98%, much higher than the accuracies of previous work.
Keywords/Search Tags:Data, Biological, Sequences
Related items