Font Size: a A A

Research On RNA Sequence Function Prediction Based On Position Specific Mismatch

Posted on:2021-03-04Degree:MasterType:Thesis
Country:ChinaCandidate:J HeFull Text:PDF
GTID:2370330611499439Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The rapid development of information technology and sequencing technology with the advancement of the Human Genome Project has provided massive amounts of biological data for humans to explore the mysteries of life.Faced with massive biological data,it is one of the difficulties of genomics to find efficient and low-cost analysis methods.RNA plays a very important role in various biological functions.sgRNA and N~6-methyladenosine are important issues in RNA sequence function prediction.Research on RNA sequence function prediction will advance the fields of genome editing and epigenetics.sgRNA and N~6-methyladenosine sequences all contain base position information,evolution information and motif information.Therefore,this paper proposes different feature extraction methods for its sequence characteristics,and combines machine learning algorithms to build models for research and analysis.The contents of this paper are as follows:In the recognition of sgRNA on-target activity,all the predictors are only able to consider the short-range sequence information of the sequences.This paper proposes a Position Specific Mismatch(PSM)method that combines the Position Specific(PS)and Mismatch methods to capture the long-range sequence information and evolutionary information of the sgRNA sequence.Combining the PSM feature vector with the XGBoost algorithm to construct the prediction models on the two datasets.The results show that the model has good prediction performance and has the ability to generalize across genes and cells.The results of feature analysis show that the important features mostly cover the Protospacer Adjacent Motif(PAM)sequence pattern,indicating that the PSM method can capture the base position information and motif information in the sgRNA sequence well.The importance of information carried by sgRNA sequences in different regions is inconsistent,and the dataset is unbalanced,this paper proposes two-window-based PSM(2wPSM)and SCORE-SVM-SMOTE methods.The 2wPSM feature vector and support vector machine algorithm are combined to construct the sgRNA-2wPSM model,and the SCORE-SVM-SMOTE method is used to balance the dataset to further improve the performance of the model.The results show that the prediction performance of the sgRNA-2wPSM method is better than that of sgRNA-PSM and sgRNA-ExPSM.Heat map is used to analyze the the importance of features in the front and rear window and the nucleotide preference at each position in the sequence,which confirms the correctness of the partition window strategy and the effectiveness of the 2wPSM feature extraction method.The nucleotide preference at each position in the sgRNA sequences are discussed and verified.In the recognition of N~6-Methyladenosine site,for the lack of segmentation methods and prediction methods based on word embedding,this paper proposes the Mismatch,Loop Variable Kmer and Motif word segmentation methods based on PSM,Kmer and motif.Then,four models are constructed by using word embedding and convolutional neural networks.The principal component analysis method is used to analyze the RNA word relationship of the single classification model.The analysis results show that the integrated strategy can further improve the prediction performance of the model.Then,an integrated model named Ensemble2Vec is proposed,which performs better on both test sets.
Keywords/Search Tags:RNA sequence function prediction, sgRNA on-target activity, N~6-methyl ade nosine site, Position Specific Mismatch, word embedding
PDF Full Text Request
Related items