Font Size: a A A

Computational analysis of thermal denaturation differences and prediction of coding and non-coding eukaryotic DNA sequences

Posted on:2004-07-15Degree:Ph.DType:Dissertation
University:University of Massachusetts LowellCandidate:Long, Dang DucFull Text:PDF
GTID:1458390011453431Subject:Biology
Abstract/Summary:PDF Full Text Request
In recent years, the exponential growth in genetic sequence data offers an unprecedented opportunity for a new understanding of biology, as well as, many great challenges in the utilization of that sequence data. Computational analysis has become essential in the investigation of genetic DNA sequences, from their biophysical properties to the information encoded in these sequences.; In Chapter Two of this dissertation, I used computational modeling and statistical analysis to investigate the thermal denaturation (melting) of eukaryotic DNA sequences in terms of the relationship between the melting temperature (Tm) and the base and sequence content in different regions of sequences. Using the program, MELTSIM, which simulates DNA melting based upon a nearest neighbor thermodynamic model, I demonstrated that the Tm vs. FGC (mole fraction of the bases G and C) relationships in coding and non-coding DNAs are both linear but have a statistically significant difference (6.6%) in their slopes. By comparing these results to the simulation results from various base shufflings of the original DNAs and the average nearest neighbor frequencies of those natural sequences across the FGC range, I showed that these differences in the Tm vs. FGC relationships are a direct result of systematic FGC-dependent biases in nearest neighbor frequencies for the coding and non-coding DNA classes. Those differences in the Tm vs. FGC relationships and biases in nearest neighbor frequencies also appear but are of smaller magnitudes between the DNA sequences from multicellular and unicellular organisms in the same coding or non-coding classes.; Chapter Three of this dissertation explores the application of biases in oligonucleotide frequencies of DNA regions measured by a biologically-relevant 3-base repeating frame along the DNA as inputs to neural networks (NNs) and support vector machines (SVMs) to predict coding or non-coding class for any DNA sequence. Using three public standard sequence datasets comprised of coding and non-coding DNA sequences, I tested the application to coding versus non-coding classification of the 3-base repeating frame calculated mono-, di-, and tri-nucleotide frequencies represented as matrix elements (3 x 4, 3 x 16, and 3 x 64 matrices). These frequencies were calculated by three different functions for three different sequence lengths (54, 108, and 162 base pairs). Overall, the prediction accuracy increases when the sequence length and the size of the 3-base frame frequency matrix increases. The highest total correct prediction numbers in both of the methods in the different sequence length conditions are relatively high, from about 77% to 98%. The NN method gave relatively high values of the sensitivity of the prediction, from 66% to 95%, but lower values of the specificity, from 35% to 66%, on the three sequence datasets. Based on the results from one dataset being tested, the SVM method showed a significant improvement in the prediction accuracy over the NN method (from 10 to 25% improvement in the correlation coefficient value). The implication of the 3-base frame dependent oligonucleotide frequencies as coding measures and the application of NNs and SVMs in the coding-noncoding prediction problem are discussed.
Keywords/Search Tags:DNA, Coding, Prediction, Frequencies, Computational
PDF Full Text Request
Related items