Font Size: a A A

Prediction And Implementation Of Subcellular Localization Of LncRNA Based On Multi-source Heterogeneous Feature Fusion

Posted on:2022-09-15Degree:MasterType:Thesis
Country:ChinaCandidate:S Y FengFull Text:PDF
GTID:2480306329498914Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Long non-coding RNA is a kind of non-coding RNA.Many studies have shown that lnc RNA plays an important and direct role in human gene transcription regulation,cell growth,differentiation,reproduction and other life activities.At the same time,the relatively low conservation of lnc RNA makes its function research more difficult.At present,differential expression analysis of lnc RNA and co-expression analysis between lnc RNA and protein can effectively analyze the function of lnc RNA,but traditional experimental methods of life science are too expensive and time-consuming,and often have more stringent experimental conditions.Some statistical calculation methods have high prediction efficiency,but the false positive rate is high.However,as the basic unit of life,considering the different responsibilities of each organelle in life activities,the localization of lnc RNA on each organelle is a good method to predict the function of lnc RNA.Computational methods can more effectively predict the subcellular localization of lnc RNA,so as to analyze the function of lnc RNA.There are many kinds of organelles,so the subcellular localization of lnc RNA is actually a multi classification problem.Due to the lack of experimental data,the number of lnc RNA sequences in some subcellular localization regions is extremely small,which makes the number of sequences in different organelle localization regions vary greatly.The imbalance of data leads to the poor recognition effect of machine learning model on small data subset,which is a difficult and challenging problem in the existing research.In order to describe lnc RNA sequences from a global and multi-level perspective,a sequence based computational tool was constructed to predict the subcellular localization of lnc RNA.The multi-source heterogeneity of lnc RNA sequence includes k-tuple,basic lnc RNA features,physicochemical properties and multi-scale secondary structure.In order to explore their effectiveness for subcellular localization targets and their representation ability under different machine learning models,we used a variety of machine learning models to test these features,including three traditional machine learning models: support vector machine,random forest and logistic regression,two integration frameworks based on boost strategy: XGboost and light GBM,and two deep learning frameworks: Deep Neural Network and Convolutional Neural Network.Experimental results show that different features contain different lnc RNA contents,and lnc RNA sequences can be described from different levels and perspectives.In order to eliminate the prediction bias caused by data imbalance and improve the representation ability of small sample data sets,a variety of feature selection methods are used to further process the features.Because of the difference in dimension and redundancy of features,these features are divided into two categories.For the original8-mer features,the filtering method based on binomial distribution is used;for the remaining features,combined with the advanced features extracted by automatic encoder,the recursive feature elimination algorithm is further used for feature filtering.By using different machine learning models to test the two features and their combination,the effectiveness of the feature selection method is discussed.It is proved that the method improves the data representation ability and reduces the problems of multi classification data imbalance and poor prediction performance caused by small samples.Finally,this paper proposes a prediction method of lnc RNA subcellular localization based on multi-source heterogeneous feature fusion,which uses multi-layer grouping feature enhancement and screening scheme,and constructs a prediction model based on support vector machine(SVM)as predictor.The model includes a nucleic acid sequence scanning input,four kinds of feature extraction algorithm modules,two automatic encoders based on fully connected neural network,two attribute lookup tables and a support vector machine classifier with strict parameter adjustment.This method is used to perform 5-fold cross validation on benchmark data set,and the final accuracy is 87.78%,which is higher than the existing tools.The prediction accuracy of89.69% is achieved on the independent verification set of 20%,and the relative accuracy is 3% higher than the existing tools.Especially for the classification subset with smaller data,the classification performance is improved significantly,and the precision of cytoplasm is improved by 25.59% The precision of ribosome was increased by 0.17%,the recall rate was increased by 19.45%,and that of exosome was increased by 48.98%.At the same time,the prediction results of different features under different models were discussed to some extent,which revealed the effectiveness of different features and models for subcellular localization of lnc RNA.Due to the small amount of labeled lnc RNA subcellular localization data,the prediction effect of deep learning model as a prediction model is relatively poor,but the feature extraction method based on deep learning model can enhance the expression ability of lnc RNA data to a certain extent,so as to improve the accuracy of prediction.At the same time,this paper also applied the model to practical applications,including prediction of lnc RNA subcellular localization on human scale transcriptome,development of related web site and open source tools.
Keywords/Search Tags:Lnc RNAs, subcellular location, multi-source heterogeneous characteristics, multiple classification problem, autoencoder, recursive feature elimination, SVM
PDF Full Text Request
Related items