Font Size: a A A

Research On Enhancer And Promoter Type Recognition Based On Sequence Information

Posted on:2019-02-17Degree:MasterType:Thesis
Country:ChinaCandidate:K LiFull Text:PDF
GTID:2370330590973920Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Today,more and more industries are entering the era of data-driven,and so is biology.In the past ten years,the biological sequence data have achieved tremendous growth,which promoted the vigorous development of many fields in biology.In the development of these fields,many biological problems need to in-depth research,such as enhancer recognition,protein remote homology detection,promoter recognition,etc.Therefore,it is an excellent approach to explore the structure and function of biological genes by mining the hidden features of biological genes from massive biological sequences.Traditional methods for the identification of enhancers and promoters use biological experiments,which are time-consuming and labor-intensive and cannot satisfy the needs of the research.Therefore,this paper extracts the sequence information of enhancers and promoters,uses different feature extraction methods to mine sequence information from different aspects,and combines machine learning algorithms to construct models for research and analysis.The main contents of this paper include:We have proposed an enhancers and their strength identification method iEnhancer-EL which based sequence information and ensemble learning strategy.The feature vectors are extracted by using different feature extraction methods,and then the models are constructed by support vector machine algorithm.Then the models are clustered,and the key models are selected from the clusters.Finally,the linear weighted ensemble method was used to build an ensemble model.After obtaining the final model,the performance among this method and other methods were compared by using multiple measurement indicators.iEnhancer-EL are better than the outstanding methods for the identification of enhancers and their strength.Furthermore,we proposed two methods iPromoter-Kmer and iPromoter-PseKNC which are based on smoothing strategy.These two methods take advantage of the difference in conservation values between the sequences in the benchmark dataset and divide the DNA sequence into several subsequences.Then extract the feature vectors on each subsequence and merge them linearly,respectively.Two feature extraction methods,Kmer method and pseudo k-tuple nucleotide composition method are used.By adjusting the hyperparameters of each feature extraction method,different feature vectors are obtained,and then the model is constructed by using support vector machine algorithm,and finally the model with the best prediction performance is selected.The identification performances of above two methods are superior to the existing outstanding methods.For identification of promoters and their types,we further proposed a method iPromoter-2L2.0,which based on ensemble learning and sliding window strategy.In every subsequence obtained by smoothing strategy,the sliding window is used to further mine the local information of the sequence,and then Kmer method and pseudo k-tuple nucleotide composition method are used with the support vector machine method to construct multiple models.Then clustering the models by using the improved metrics and finally select the key models for ensemble learning.iPromoter-2L2.0 achieves better detection performance than iPromoter-Kmer and iPromoter-PseKNC.
Keywords/Search Tags:Enhancer identification, Promoter identification, Distance between models, Window-split algorithm based on smoothing strategy, Ensemble learning
PDF Full Text Request
Related items