| With the rapid development of the new generation sequencing technology,large quantity original biological sequences can be obtained in a short time,including non-coding RNAs.Non-coding RNA(nc RNAs)refers to RNAs that have not been encoded and translated into proteins,while long non-coding RNAs(lnc RNAs)refer to nc RNAs with a length greater than 200 nucleotides.Lnc RNA act a crucial part in cell growth,cell apoptosis,disease regulation and genetic expression.The accurate prediction of lnc RNA sequences from a large number of biological sequences can take a foundation for prospective study of the structure and function of lnc RNA.In this thesis,balanced samples of human and mouse lnc RNAs and protein-coding transcripts(PCTs)were selected as the main datasets,and those of fruit fly and zebrafish were selected as the cross-species datasets.Single-modal models and a multi-modal model based on Convolutional Neural Network(CNN)and Graph Convolutional Neural Network(GCN)are constructed to identify lnc RNA.Among them,single-mode models include CNN model and GCN model,CNN model is applied to the primary sequence and secondary sequence of lnc RNA(hereinafter referred to as CNN-First and CNN-Second),and GCN model is applied to the secondary structure plane graph of lnc RNA.The multi-modal integration model is constructed by CNN-first,CNN-second and GCN models through voting.The integration model can extract the sequence information and plane structure information of lnc RNA at the same time.Both single-mode and multi-mode models were trained and tested on the experimental master dataset by means of five-fold cross validation(5-CV)and the cross-species dataset was tested as an independent dataset.At the same time,in order to verify the robustness of the model,an unbalanced sample set based on the master data set was constructed to test models in the way of5-CV.The experimental results show that the integrated model is significantly better than the single-mode model in both main data set and cross-species data set.The ACC value and AUC value of the integrated model were 93.51% and 96.07% in the human master data set,and 94.64% and 95.55% in the mouse master data set.By comparing the integrated model with other classical lnc RNA identification methods,the recognition results of the integrated model showed certain advantages.Based on the experimental and comparative results,we believe that the integrated model has high accuracy and reliability in the identification of lnc RNA.. |