Font Size: a A A

Research On Essential Genes Recognition Based On Sequence Information

Posted on:2021-05-10Degree:MasterType:Thesis
Country:ChinaCandidate:J H ChenFull Text:PDF
GTID:2370330611498853Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the maturity of high-throughput sequencing technology,biology has entered a data-driven era,and bioinformatics as an emerging discipline has also developed rapidly.Essential gene plays a key role in life activities,how to identify and analyze essential genes from massive biological sequence data is one of the important tasks in bioinformatics research.Traditional methods based on biological experiments have shown time-consuming and labor-intensive limitations in the identification of essential genes,thus essential genes identification based on machine learning becomes a hot research direction in this field.This research focuses on the problem of essential genes identification,and specifically studies the essential genes of two kind of species including Archaea and human.The corresponding feature vectors are extracted according to the sequence composition information and position information of essential genes in Archaea and human,then the machine learning algorithm is used to construct the prediction model.The main research contents in this thesis are as follows:In the study of essential genes identification in Archaea,in view of the lack of existing feature extraction method,a new feature extraction method called ZCPse KNC is proposed which can extract base composition and position information from essential gene sequences.Then XGBoost algorithm is used to calculate feature importance and select discriminative features.Support vector machine algorithm(SVMs)is used to train and construct the prediction model of essential genes in Archaea.Finally,the data imbalance problem is analyzed and discussed,and three oversampling methods are applied to balance the dataset.The experimental results show that the proposed method in this thesis achieves satisfactory performance for the identification of essential genes in Archaea.In the study of essential genes identification in human,the dataset imbalance problem in human is more serious than Archaea,in view of this,a new oversampling method called CSMOTE is proposed based on clustering strategy.At the same time,the features of essential gene are extracted based on the ZCPse KNC method,and SVM-RFE+CBR algorithm is used to select features.Finally,support vector machine algorithm is applied to construct the human essential genes prediction model.The experimental results show that the proposed CSMOTE method can improve the predictive performance of prediction model.The prediction model based on the CSMOTE method improved the prediction accuracy of essential genes in human,thus has more practical application value.For base substitution problem of sequence in human essential gene recognition tasks,a new feature method called cps Mismatch is proposed to obtain base substitution information,and cps Mismatch features are further combine with cps Kmer features.Then features are filtered,and CSMOTE method is used for data oversampling.Bootstrap aggregating(Bagging)strategy is applied to further improve the performance,an ensemble learning model is constructed to predict essential genes in human by using support vector machines as the base classifier.The experimental results show that the prediction model proposed in this thesis reaches a better overall performance for human essential genes identification.
Keywords/Search Tags:essential genes identification, Archaea, human, oversampling, ensemble learning
PDF Full Text Request
Related items