Computational analysis of thermal denaturation differences and prediction of coding and non-coding eukaryotic DNA sequences

Posted on:2004-07-15

Degree:Ph.D

Type:Dissertation

University:University of Massachusetts Lowell

Candidate:Long, Dang Duc

Full Text:PDF

GTID:1458390011453431

Subject:Biology

Abstract/Summary:

PDF Full Text Request

In recent years, the exponential growth in genetic sequence data offers an unprecedented opportunity for a new understanding of biology, as well as, many great challenges in the utilization of that sequence data. Computational analysis has become essential in the investigation of genetic DNA sequences, from their biophysical properties to the information encoded in these sequences.; In Chapter Two of this dissertation, I used computational modeling and statistical analysis to investigate the thermal denaturation (melting) of eukaryotic DNA sequences in terms of the relationship between the melting temperature (T_m) and the base and sequence content in different regions of sequences. Using the program, MELTSIM, which simulates DNA melting based upon a nearest neighbor thermodynamic model, I demonstrated that the T_m vs. F_GC (mole fraction of the bases G and C) relationships in coding and non-coding DNAs are both linear but have a statistically significant difference (6.6%) in their slopes. By comparing these results to the simulation results from various base shufflings of the original DNAs and the average nearest neighbor frequencies of those natural sequences across the F_GC range, I showed that these differences in the T_m vs. F_GC relationships are a direct result of systematic F_GC-dependent biases in nearest neighbor frequencies for the coding and non-coding DNA classes. Those differences in the T_m vs. F_GC relationships and biases in nearest neighbor frequencies also appear but are of smaller magnitudes between the DNA sequences from multicellular and unicellular organisms in the same coding or non-coding classes.; Chapter Three of this dissertation explores the application of biases in oligonucleotide frequencies of DNA regions measured by a biologically-relevant 3-base repeating frame along the DNA as inputs to neural networks (NNs) and support vector machines (SVMs) to predict coding or non-coding class for any DNA sequence. Using three public standard sequence datasets comprised of coding and non-coding DNA sequences, I tested the application to coding versus non-coding classification of the 3-base repeating frame calculated mono-, di-, and tri-nucleotide frequencies represented as matrix elements (3 x 4, 3 x 16, and 3 x 64 matrices). These frequencies were calculated by three different functions for three different sequence lengths (54, 108, and 162 base pairs). Overall, the prediction accuracy increases when the sequence length and the size of the 3-base frame frequency matrix increases. The highest total correct prediction numbers in both of the methods in the different sequence length conditions are relatively high, from about 77% to 98%. The NN method gave relatively high values of the sensitivity of the prediction, from 66% to 95%, but lower values of the specificity, from 35% to 66%, on the three sequence datasets. Based on the results from one dataset being tested, the SVM method showed a significant improvement in the prediction accuracy over the NN method (from 10 to 25% improvement in the correlation coefficient value). The implication of the 3-base frame dependent oligonucleotide frequencies as coding measures and the application of NNs and SVMs in the coding-noncoding prediction problem are discussed.

Keywords/Search Tags:

DNA, Coding, Prediction, Frequencies, Computational

PDF Full Text Request

Related items

1	A computational tool for the prediction of small non-coding RNA in genome sequences
2	Research On Key Techniques Of Video Coding
3	Research On Fast Inter-prediction Algorithm For HEVC Based On Visual Saliency
4	Research On Vector Quantization For Linear Predictive Coefficients In Embedded Variable Bit Rate Speech Coding
5	Research On Fast Intra Prediction Algorithm With High Efficiency Video Coding
6	Research On Video Coding Optimization For HEVC Based On Visual Saliency
7	Research On QP Adiustment Algorithm Based On Prediction Mode Of Coded Prediction Blockin AVS2
8	Research On Key Techniques Of Mobile Audio Coding And Decoding
9	Content Based High Performance Intra Coding Study
10	Research On Inter Prediction And Transform Coding In Next Generation Video Coding Standard