Font Size: a A A

Identifying E.coli And Human Promoter Based On Sequence Informarion And Structure Informarion

Posted on:2021-05-04Degree:MasterType:Thesis
Country:ChinaCandidate:D H GuoFull Text:PDF
GTID:2370330620476581Subject:Physics
Abstract/Summary:PDF Full Text Request
Promoter is a DNA sequence that binds to RNA polymerase and initiates gene transcription,it can regulate gene expression by combining with transcription factor?TF?.Therefore,the identification and analysis of promoters are helpful to the epigenetic research,pathway analysis,and functional annotation of genomes.E.coli is one of the model organisms in the Human Genome Project,the genetic structure of this model organism is relatively simple compared to the human genome,it can be helpful for the research of human genome in understanding function of genes,and advance the development of the Human Genome Project.Therefore,the researchers used experimental methods and bioinformatics methods to identify E.coli and human promoters.However,experimental methods costs a lot of time and money.Therefore,more researches are to predict the promoter by using bioinformatics methods.Among these methods,most of them are predictions for E.coli?54 and?70promoters,while about?38 promoters has little work been done.The?38 factor is involved in the general stress response of bacteria and is extremely sensitive to changes in the environment of cell.So identifying the E.coli?38 promoter is helpful for understanding the cellular response to overcome environmental changes.Therefore,in this work,we propose the method of predicting model organism E.coli?38 promoter,and the same method is applied to predict the human gene promoter,and better prediction results are obtained for two datasets.In this paper,a dataset of E.coli?38 promoter is built and the human promoter dataset is downloaded.Because the structural nature of DNA plays a key role in the function of genes,choosing appropriate structural features is also very helpful for promoter recognition.So,in this paper,a position correlation scoring function?PCSF?is constructed based on the combination of the position correlation probability matrix and the DNA six structural parameters.The promoter sequence is identified by scoring difference,and better prediction effect has been achieved.Next,the algorithm fusion is done by inputing the scoring difference as feature parameter into the support vector machine?SVM?,the prediction effect is significantly improved.The sequence features such as k-mer composition information,GC content information,CG+GC dinucleotides information,and base distances information of the promoter were also extracted in this work.The promoters were predicted by using the SVM algorithm in5-fold cross validation.Finally,we analyzed in detail that the influence of various feature parameters on the prediction effect.For the E.coli?38 promoter sequence,based on the fusion of Rise structure information and GC content information,the prediction accuracy achieves 96.07%in jackknife test.For the human promoter sequence,the prediction accuracy achieves 94.56%in jackknife test by using the fusion of Slide structure information with 4-mer composition information and CG+GC dinucleotides information.The analysis results demonstrated that the interaction energy between the neighbour dinucleotides represented by the DNA six structural parameters can better recognize E.coli?38 promoter and human promoter.
Keywords/Search Tags:promoters, position correlation probability matrix, position correlation scoring function(PCSF), DNA structural information, DNA sequence feature
PDF Full Text Request
Related items