| In the human genome,non-coding sequences account for more than 90% of the genomic sequence,and micro RNA(miRNA),as a type of non-coding RNA,plays an important regulatory role in cell differentiation and tissue development.The disorder regulation of miRNA will affect the growth and differentiation of cells,and the disorder or overexpression of miRNA will inhibit the proliferation or metastasis of various cancers,so the study of miRNA identification has important theoretical value and application significance in the diagnosis and treatment of diseases.miRNA identification methods are mainly divided into two types :experimental cloning and computer simulation prediction.In experimental cloning methods,miRNA identification requires specific development time or tissue expression,which makes the methods have certain limitations.In the computer simulation prediction methods,the existing deep learning methods do not pay attention to the temporal information and spatial information of miRNA at the same time,the spatial information of miRNA contains function information of miRNA,and the base sequence of miRNA(temporal information)will affect the normal regulation of miRNA molecules.Therefore,it is very necessary to capture the temporal and spatial information of miRNA.In this paper,to solve the problems in existing miRNA identification researches,the cascade CNN-BLSTM framework is constructed from the perspective of sequence and secondary structure,and the complex and abstract temporal and spatial information of miRNA is captured to identify miRNA.The main research contents of this paper are as follows:(1)In the preprocessing stage of miRNA sequences,this paper uses CD-HIT and RNAfold tools to delete redundant sequences and obtain the secondary structure corresponding to the sequence,and uses the one-hot encoding method to complete the vectorization of the pre-miRNA sequence and secondary structure.(2)Since the existing deep learning methods do not pay attention to the temporal and spatial information of pre-miRNA at the same time,resulting in the lack of information,this paper proposes a cascade CNN-BLSTM model.The model first introduces CNN to extract local spatial information,at the same time,in order to extract contextual features of the sequence and explain the timedependence of pre-miRNA sequences,CNN-BLSTM model introduces the BLSTM neural network.The model captures comprehensive spatial and temporal information of pre-miRNA sequences and secondary to classify miRNAs.(3)Due to the general imbalance problem in experimental data,this paper compares three imbalance processing schemes,and selects the focal loss function with the best performance through experiments to reduce the impact of data imbalance on the model performance,so that the classifier can focus positive samples improve model prediction ability.In summary,this model considers sequence and secondary structure information at the same time,and simultaneously captures the complex and abstract spatial and temporal information of miRNA,and introduces a data imbalance solution to construct a cascaded CNN-BLSTM model.The experimental results show that compared with existing research,our model has greater advantages in SE,PPV,F-score indicators,indicating that the model in this paper can identify pre-miRNA more effectively,which provide a new theoretical basis for cell physiology and pathology research. |