In plants,small RNAs are a class of short non-coding RNAs with 20-24nt in length.They can bind to different types of Argonaute(AGO)proteins to form RNA-induced Silencing Complex(RISC),which can cause gene silencing by cleaving target mRNA or mediating epigenetic changes on chromatin,and play a crucial regulatory role at the transcriptional or post-transcriptional level.Studies have shown that sRNAs plays an important role in vegetative and reproductive organs growth and development,signal transduction,hormone synthesis,morphogenesis,stress response etc.AGO protein is the core protein of RNAi mechanism in eukaryotes and the function of small RNAs is dependent on the type of binding AGO proteins.As is known to all,RNA Immunoprecipitation sequencing(RIP-seq)can identify AGOsbinding small RNAs(asRNAs).However,this experimental method is time-consuming and laborious,it is difficult to identify samples of different tissues and treatments on a large scale.Based on the RIP-seq data of different AGO proteins in Oryza sativa and Arabidopsis,we used machine learning and deep learning algorithm to build the classification model,which laid the foundation for the identification and function analysis of asRNAs.In this study,GO enrichment and pathway analysis were also performed based on the target genes which regulated by asRNAs.Moreover,we also identified the sources of these asRNAs.In summary,according to the methods of machine learning and deep learning,this study can identifiy asRNAs with biological significance from the large number of small RNA sequencing data with unknown functions.It provides a new perspective for the genome-wide small RNAs researches.The conclusions of this work are summarized as follows:1.Since the traditional RNA feature extraction methods do not fully consider the global characteristics and statistical characteristics of RNA sequences.A feature extraction method based on global features,named matrix of statistical features,is proposed in this work.This method not only comprehensively considers the first base at the 5 ’end of RNA sequences,but also fully considers the two important sequence statistical features of RNA sequences,including length and GC content.The results showed that the proposed feature extraction method improves the F1-score of all classification models approximately 2.6%-15.5%.In the meantime,this study also uses the combination of different feature extraction methods,which further improves the Fl-score of all classification models approximately 1%-3%.Besides,the combination of RNA sequence composition,RNA secondary structure status and matrix of statistical features can guarantee the high accuracy of the classification model while considering the computational speed of feature extraction.2.Based on the RIP-seq data of AGO proteins in Oryza sativa and Arabidopsis,8 machine learning-based classification algorithms were evaluated comprehensively by using improved feature extraction methods.In the meantime,the performance of 2 deep learningbased classification algorithms were evaluated by using one-hot encoding method as well.The results showed that the ensemble learning models have better performance than the traditional and deep learning-based models,especially LightGBM model.The performance indexes of this classification model are approximately 88%-99%after 10-fold cross validation.The learning curve also indicates that LightGBM classification model not only has high classification accuracy,but also shows excellent generalization ability.3.To determine the source of AGO1-related asRNAs,the mature microRNA(miRNA)sequences from miRBase were used to identify whether asRNAs are reported miRNAs.In addition,we developed the phased siRNAs(phasiRNAs)prediction algorithm and built the online platform,named PhasiRNAnalyzer(https://cbi.njau.edu.cn/PPSA/)to identify whether asRNAs are predicted phasiRNAs.The results showed that about 50%of tissuespecific miRNAs could be found in identified AGO1-related asRNAs in rice and there are 20%of AGO1-related asRNAs have been ifentified as 21nt phasiRNAs in rice,Therefore,besides miRNA and phasiRNAs,we speculated that there are some other unknown small RNAs which interact with AGO1 and play regulatory functions in rice.In the meantime,we found that phasiRNAs are more enriched in reproductive organs compared to vegetative organs in rice and most of the identified asRNAs came from intergenic region.4.To study the function of AGO 1-related asRNAs,we developed the verification algorithm between small RNAs and their target genes based on degradome sequencing data,and built the online server,named WPMIAS(https://cbi.niau.edu.cn/WPMIAS/).Based on this online server,we verified the target genes of identified AGO 1-related asRNAs from rice small RNA-seq data across different tissues and found that the verification rate of asRNAs with biological function was higher than 76.3%,and even reached 100%in panicle tissue.The results also showed that AGO 1-related asRNAs were involved in the growth and development processes,especially in reproductive organs,including flower development,morphological formation and multiple metabolic pathways by using GO analysis.The results also demonstrated that about 90%AGO 1-related asRNAs play an extensive regulatory role,which involved in various biological processes,including cell differentiation,DNA synthesis and repair,growth and development,transcription regulation in rice.In the meantime,it was also found that 30%target genes of AGO1-realted asRNAs are transcription factors,suggesting that they play a vital role in the development and stress responses in rice.5.After analyzing asRNAs classification models by using training datasets of Oryza sativa and Arabidopsis,we found that the classification models were conserved to some extent.This result also indicated that the classification models are valuable for the identifications of asRNAs in other monocotyledons or dicotyledons.Based on the AGO1related asRNAs classification model of Oryza sativa and Arabidopsis,we also performed a comprehensive analysis of asRNAs which identified in 4 monocotyledons and 4 dicotyledons respectively.The results demonstrated that most of the identified asRNAs were 21nt in length,and showed the preference to uracil at the 5’end.Meanwhile,more than 60%of the tissuespecific miRNAs in these 8 plants can be identified in the AGO 1-related asRNAs,suggesting that AGO 1-related asRNAs are highly conserved in plants.6.We also developed an online platform,named ASRNAIS(https://cbi.njau.edu.cn/ASRNAIS/)for the identifications of plant asRNAs based on the high reliable classification models,which provides a basis for further understanding the biological functions of these small RNAs. |