Font Size: a A A

The Theoretical Prediction Research Of Replication Origins Of Genome Based On Physicochemical Properties Of The Sequence

Posted on:2017-04-17Degree:MasterType:Thesis
Country:ChinaCandidate:W C LiFull Text:PDF
GTID:2180330485988307Subject:Biophysics
Abstract/Summary:PDF Full Text Request
Throughout the process of genetic information transfer, the genetic information is transferred from parent to offspring by copying genome so as that the offspring displays the similar genetic traits with parent. Gene mutation and genetic recombination mutation process also easily happen in this process, thus, it has attracted many biologist to study the process. As the first step in the study of replication, it is necessary to accurately identify the DNA replication origin which is also crucial to investigate the transfer of genetic information. Currently, there are many disadvantages in the presence of domestic and foreign research methods, such as the redundant data sets, the lack of physical characteristics of the DNA sequences and their long-range correlation, and the lack of online services.In this thesis, the DNA replication origin(ORIs) of yeast which is a model organism of unicellular eukaryotes was initially study by using bioinformatics method.By extracting features from DNA components as well as physicochemical properties of DNA sequence, the prediction model was constructed by use of machine learning method and statistical algorithms. Based on the model, a web server was established and used to scan the whole yeast genome for finding the potential ORIs. The distribution of ORIs in yeast genome was analyzed. We extended the approach to human genome for the prediction of ORIs and achieved encouraging results.Firstly, the experimentally validated ORIs of yeast genome were selected as positive dataset, while the upstream sequence of ORIs was regarded as negative dataset.The CD-HIT software was used to remove redundant sequences. Secondly, we used a new feature extraction method called pseudo k-tuple nucleotide composition(PseKNC)which was incorporated the six structural parameters of dinucleotides, which can reflect the intrinsic correlation between local/global features and the ORI sequences. Finally, a powerful algorithm support vector machine(SVM) was used to operate the prediction.In the jackknife cross-validation test, the overall prediction accuracy is 83.72%. For the further improving the accuracy, the features from sequence cleavage and blend were added into the prediction model. The overall accuracy was improved to 84.09%.By investigating the distribution of six parameters in both positive and negative samples,we found that the distributions of rise, slide and tilt were different between two kinds ofsamples. The proposed model was used to scan the whole yeast genome. Results showed that the 385 ORIs can be correctly identify(93.9% of overall accuracy). Based on the model, an online tool named “iOri-PseKNC” was constructed. The user can freely access from http://lin.uestc.edu.cn/server/iOri-PseKNC. We also analyzed the distribution the ORIs in yeast genome and statistically analyzed the correlation between ORIs and nucleosomes, promoters as well as genes. Results exhibited that the ORIs always appear in the nucleosome-free regions. Over 31.46% of the 5015 promoter sequences whose distances between the ORIs and transcription start sites(TSSs) are less than 500 bp.To explore ORIs of the complex multicellular eukaryotes, as well as to further validate our algorithm versatility, the method was used in the human genome. The accuracy of 63.01% was obtained by SVM. By using Weka tool, the accuracy can be improved to 75.04%. The sequence cleavage and blend in both positive and negative samples were investigated and the difference was observed.
Keywords/Search Tags:yeast, origin of replication, support vector machines, online service
PDF Full Text Request
Related items