Font Size: a A A

Application Of Sequence Complexity Features In Prediction Of Regulatory Elements

Posted on:2019-07-25Degree:DoctorType:Dissertation
Country:ChinaCandidate:C C WuFull Text:PDF
GTID:1360330548453420Subject:Bioinformatics
Abstract/Summary:PDF Full Text Request
The human genome project drew a conclusion that more than 98 percent of human genome were non-coding regions.The research,given by ENCODE project and Roadmap epigenomics project,demonstrated that these non-coding regions contained plenty of DNA regulatory elements including DNA methylation site,promoter and enhancer.DNA regulatory elements regulated gene expression precisely via activate and restrain of the transcription.On the one hand,the DNA regulatory elements participated in the regulatory of transcription due to its sequence specificity;on the other hand,DNA sequence could be regard as finite character string over the alphabet with finite words mathematically,so the research with complexity of DNA sequence could dig out some sequence specificity.All these facts motivated us to discriminate DNA sequence by math tools applied on the sequence complexity.With the first part of this research,we described the mathematical definition of two kinds of sequence complexity in detail,studied the algorithms emphatically and picked up the effective features ultimately.First of all,we obtained the original features of factor complexity with different length.Then selected the topological entropy by second-order difference.Finally,ensured the effective features of factor complexity.In the meanwhile,we studied the definition and algorithms of abelian complexity,then screened the effective features of abelian complexity.During the second part,we focused on building up a model that can predict the level of DNA methylation.We obtained the experimental data of DNA methylation in the human embryo cell as training data,screened the Cp G out of the whole data.After amplified the Cp G point to the appropriate length of sequences,we picked up factor complexity features and the basic DNA composing features,then built up a model on predicting the methylation level on Cp G sites based on support vector machine with 94.7 percent accuracy.The model has higher accuracy when compared with analogous.Finally,a statistical test of the experimental data and the predicted data on functional regions,which implies reliable prediction of Cp G sites.During the third part,another model was built up with the features of abelian complexity to predict the enhancer regions.We got the data of enhancer regions from FANTOM 5 project as training data,pick up the feature of abelian complexity,and built up the predicted model with random forest.The model with ratio of 1:1 between positive samples and negative samples reached accuracy of 93.1% and the model with ratio of 1:10 between positive samples and negative samples reached accuracy of 96.0%.When compared with similar models,the model built up with abelian complexity captured better accuracy.Finally,applied the model into the screen of the human chromosome 22 with step 100 bp,and acquired 5,123 potential enhancers.The test with the histone modification from different cell lines and tissues showed the best accuracy of predicted enhancer was 42.8%.In conclusion,we built up precisely predicted models with sequence complexity feature and basic sequence composition features to study DNA methylation and enhancer regions.The scanned results during the whole genome with predicted model could reduce the range and difficulty of biology experiments and it provided powerful reference and guidance to relevant researches.Also,the predicted model could help analysis the transcription regulatory mechanism of complex diseases and completed the annotation of human genome functional elements.
Keywords/Search Tags:DNA methylation, Enhancer, Factor complexity, Abelian complexity, Prediction model
PDF Full Text Request
Related items