Font Size: a A A

Research On Prediction Algorithms Of Animal And Plant LncRNA

Posted on:2022-01-23Degree:MasterType:Thesis
Country:ChinaCandidate:L CaoFull Text:PDF
GTID:2480306560974819Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the development of science and technology,next-generation sequencing has been continuously applied to the identification and annotation of lnc RNA transcripts.Thousands of biological big data are swept in,and more and more unknown areas are constantly being revealed.However,the structure of lnc RNA and m RNA is very similar,and both have biological functions,which makes it difficult to identify lnc RNA from many transcripts.Many experimental methods for verifying the transcriptome require a lot of time and resources,and most of the traditional identification experimental methods are time-consuming and expensive.A large number of studies have shown that it is feasible to identify these transcription sequences by using cutting-edge scientific computing methods.Based on RNA transcripts,this paper studies the prediction algorithms of long non-coding RNAs of animals and plants.Aiming at the problems of scalability,universality,fault tolerance and computational efficiency that existing recognition tools and algorithms lack,we formulate efficient strategies for lnc RNA identification.Designed and implemented lnc RNA prediction algorithms based on sequence alignment,multi-feature classification,and deep learning.The specific work done is summarized as follows:(1)A sequencing alignment-based lnc RNA prediction algorithm(Pre Lnc-Seq)was designed and implemented.Aiming at the advantages and disadvantages of the traditional sequence alignment algorithm,the sequence alignment algorithm is based on the reference data clustering class to remove redundancy and approximate ratio respectively,so as to reduce the time and space consumption of sequence alignment and ensure a certain accuracy.The transcriptional sequences with high confidence in the same species were extracted by clustering as the reference data set,and the time and space complexity of the algorithm were reduced by CD-HIT,and then the predicted sequences were compared with the sequences in the predicted sequence.Secondly,different E values were set and BLAST,a sequence alignment algorithm with approximate ratios,was used for comparison.Then,the best CD-HIT parameters and E value were selected to determine the final lnc RNA prediction algorithm.To a certain extent,this algorithm improves the efficiency of traditional sequence alignment prediction methods.(2)A high precision multi-feature classification Lnc RNA prediction algorithm(Pre Lnc)was designed and implemented.By analyzing the limitations of sequence ratio algorithm in prediction effect and computation time,the prediction algorithm Pre Lnc is designed from the perspective of machine learning and feature engineering.Firstly,P and Z values adjusted by False Discovery Rate(FDR)were used to screen out nucleotide feature subsets for animals and plants,respectively,and then candidate feature sets were formed with 11 important features.Secondly,Pearson correlation coefficient was used to remove the redundant items of linear correlation,and the feature ranking list was obtained.Incremental feature selection method is used,F-Measure value is taken as increment,and multiple methods such as logistic regression,support vector machine and random forest are used for comparison.Finally,a balanced random forest prediction model suitable for each species is established,and conclusions related to biology are summarized and analyzed.Compared with other tools,Pre Lnc can compute features directly from transcripts and is scalable,versatile,and fault-tolerant.Pre Lnc has good predictive performance and supports the prediction of lnc RNAs in a variety of species.(3)The deep learning lnc RNA prediction algorithm(Pre Lnc-LSTM)was designed and implemented.The popular Deep Learning Long Short-Term Memory(LSTM)network was applied to the prediction of lnc RNA transcripts,and a deep learning-based lnc RNA prediction algorithm was designed and implemented.First of all,the sequence was preprocessed by batch filling and one-hot coding.Secondly,the features with significant classification ability in the multi-feature classification prediction algorithm were summarized and analyzed.Chi-square test was used to analyze the significance of CDS percentage(Coding sequence)and sequence length,and it was fused with One-Hot sequence Coding.Finally,Keras is used to build the model.From the prediction results,the prediction and recognition ability of LSTM network is lower than the prediction model of multi-feature classification,and LSTM network has a strong dependence on the training data.However,the prediction results of the Pre Lnc-LSTM algorithm for other species show a certain applicability advantage.
Keywords/Search Tags:LncRNA prediction, Sequence alignment, Multi-feature classification, Deep learning, LSTM
PDF Full Text Request
Related items