Font Size: a A A

Research On Gene Promoter Prediction Methods Based On Machine Learning

Posted on:2023-01-02Degree:MasterType:Thesis
Country:ChinaCandidate:M WangFull Text:PDF
GTID:2544306776478294Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The gene promoter is a nucleotide sequence located near the transcription initiation site,which bines to RNA polymerase to ensure accurate gene transcription.Promoters play a key role in research fields such as gene regulation and treatment,targeted drug research and development,and identification of relative relationships.Therefore,it is an important scientific issue to develop efficient computational methods for the recognition of promoters from genome sequences.Traditional biological experimental methods are time-consuming and laborious.Although many computational methods of promoter identification proposed by the research community in recent years have the advantages of high-efficiency and convenience.It is found that the gene morphology,structure and function of species are more complex with the gradual deepening of genome research,the existing methods also suffer from the computational bottlenecks.To address these issues,this paper explores novel gene promoter prediction methods based on machine learning to further improve the prediction performance and provide a new approach for promoter-related research,which has important theoretical value and application prospects for promoting gene regulation research and targeted drug research and development.The main contents and conclusions are as follows:(1)Research on multi-source feature fusion method of promoterIn this paper,a promoter multi-source feature fusion method was designed to solve the problem that the promoter feature could not be fully explained due to the singleness of existing promoter feature extraction methods.Sequence features and deep features were extracted respectively by statistical analysis and the deep learning model.A feature fusion and feature selection method based on XGBoost was designed and implemented to determine the optimal feature selection.The experimental results showed that the multi-source promoter feature fusion method combined with the advantages of heterogeneous features could fully represent promoter features,which provide the solid basis for the construction of a promoter prediction model and cross-species prediction analysis.(2)Research on the construction method of promoter prediction modelDue to the poor performance and generalization ability of the state-of-the-art promoter prediction models,the promoter prediction model Predpromoter-MF is constructed based on machine learning.Based on the preprocessing of common prokaryotic and eukaryotic promoter datasets,a multi-layer binary classification scheme is formulated for the multi-classification of E.coli promoters with unbalanced sample size.Based on the deep forest model,a two-layer prediction framework was constructed to classify the promoters and their types or structures.The evaluation model is visualized using the SHAP method.The experimental results showed that the accuracy of Predpromoter-MF is 1.04%,0.05%,10.96%,1.66%,6.83% and 3.21% higher than that of the state-of-the-art methods on the training set of B.subtilis,E.coli,human,mice,drosophila and arabidopsis,respectively.(3)Research on prediction and analysis method of cross-species promoterWith the development of model organisms and cross-species testing strategies in the era of genome research,a cross-species promoter predictive analysis method was constructed based on machine learning.Enhanced data models were designed based on human and mouse promoter datasets to construct enhanced datasets.Evaluate the independent test performance of machine learning model and optimize the parameters of NGBoost algorithm with the best performance to train the CPPM(Cross-species Promoter Prediction Model).The enrichment distribution of human and mouse promoters was visualized by sample sequence analysis.The experimental results showed that the accuracy of CPPM is 15.08% and 9.12% higher than the mainstream cross-species prediction method,respectively,on the independent test set of the human TATA-box promoter and all promoters.The sequence visualization results verified the feasibility of CPPM and provided a new perspective for analyzing promoters and other regulatory elements.
Keywords/Search Tags:Promoter, Feature Fusion, Deep Forest, NGBoost, Cross-species Testing
PDF Full Text Request
Related items