Font Size: a A A

Based On Sequence-Order And Position-Correlation Information Recognizing Promoters

Posted on:2019-03-17Degree:MasterType:Thesis
Country:ChinaCandidate:H Y LaiFull Text:PDF
GTID:2310330563954136Subject:Biophysics
Abstract/Summary:PDF Full Text Request
Promoter is a DNA element located around the transcription start site and can regulate gene transcription.It commonly consists of a core promoter region and regulatory regions.In the process of RNA synthesis,promoter can interact with the transcription factors which are in charge of gene transcription and then control the initiation time and expression level of gene.The promoter is a fundamental and well-studied regulatory region in the vicinity of the transcription start site.Thus,promoter recognition is of great significance in determining transcription units,studying gene structure,analyzing gene regulation mechanisms,and annotating gene functional information.Traditional methods for accurately identifying gene promoters generally involve complex biological experiments.It is extremely time-consuming and costly to experimentally recognize promoters on a genomic scale.The largely accumulated promoter data and the advancement of sequencing technology in recent years allow promoter recognition by computational approaches.Many models have already been proposed to predict promoters based on the sequence similarity,conservation,signal motifs,nucleotide composition etc.However,the predictive powers of these methods are limited and their classification precision should be further improved.Hence,this work puts forward a novel sample description method to improve the predictive capability of these models.In this thesis,we constructed promoter benchmark datasets of five species: Homo sapiens,Drosophila melanogaster,Caenorhabditis elegans,Bacillus subtilis,and Escherichia coli.The short-range nucleotide composition information,long-range physicochemical correlation information and position-correlation information of 3-mer oligonucleotides of promoter sequences were extracted to calculate pseudo k-tuple nucleotide composition(PseKNC)and position-correlation scoring function(PCSF)for formulating promoter samples.In order to eliminate the noisy and redundant information generated from different features,we adopted Minimum Redundancy Maximum Relevance(mRMR)algorithm to rank all features and then utilized increment feature selection to find out the optimal feature subsets which could produce the maximum accuracies.The support vector machine was used to implement classification.10-fold cross-validated results showed that the accuracies for H.sapiens,D.melanogaster,C.elegans,B.subtilis,and E.coli were respectively 93.3%,93.9%,95.7%,95.2%,and 93.1% and the areas under ROCs(AUCs)for the above five species were respectively 0.974,0.975,0.981,0.988,and 0.976.The comparison with published results revealed that our promoter prediction models were superior to other models.Finally,for the convenience of other scholars,an online web-server for implementing the novel promoter identification methods(http://lin-group.cn/server/iPro-PseKNC)was established based on our models.
Keywords/Search Tags:eukaryotic and prokaryotic promoters, pseudo k-tuple nucleotide composition (PseKNC), position-correlation scoring function (PCSF), Minimum Redundancy Maximum Relevance (mRMR), Support Vector Machine(SVM)
PDF Full Text Request
Related items