Font Size: a A A

Research On Plant Non-coding RNA Recognition Method Base On Machine Learning

Posted on:2024-04-10Degree:MasterType:Thesis
Country:ChinaCandidate:X H ZhouFull Text:PDF
GTID:2543307106465664Subject:Agriculture
Abstract/Summary:PDF Full Text Request
Non-coding RNAs,which are RNAs that are not normally involved in coding for proteins,are a hot topic of research in the life sciences today.A growing number of studies now show that non-coding RNAs are closely linked to the development of many diseases and play an important role in the regulation of plant nutritional homeostasis,development and stress responses.With the continuous development of sequencing technology,these sequences provide a valuable opportunity for the study of non-coding RNAs.Thus,the precise identification of non-coding RNAs has laid an important foundation for future studies exploring their structure and function.Traditional biological experimental methods are time-consuming and labour intensive.In recent years,many algorithms based on machine learning have been developed to identify non-coding RNAs.Compared to human non-coding RNAs,plant non-coding RNAs are more difficult to identify because of their structure and quantity.Therefore,it is of great importance to use machine learning methods to identify plant non-coding RNAs efficiently and accurately.The work in this thesis focuses on the identification of plant non-coding RNAs in three main areas:Study of the effects of different features,different feature selection methods and modelling approaches on plant non-coding RNA recognition models.Initially,91 biologically significant features from different literature were selected,followed by four different feature selection methods(F-test,variance threshold filtering,random forest,Ftest + variance threshold filtering)to compare the results,and the best results were found after the F-test + variance threshold filtering.Finally the filtered features were given as input to three traditional machine learning models(Random Forest,Parsimonious Bayes,Support Vector Machine)and four automatic machine learning models(Auto Gluon,Autokeras,TPOT,H2O)for modelling.The results showed that the Auto Gluon model performed best,with an accuracy of 95.25% on the test set,1.5% higher than the next best result,Autokeras.In addition the method compares favourably with the results of the currently widely used CPC2,CNII,CPAT and CPPred on independent test sets constructed for nine species(chickpea,Darwin cotton,lettuce,cassava,wild plantain,water lily,potato,sorghum and maize).A deep learning-based model for non-coding recognition of plants was designed and implemented.Recurrent neural networks(RNN,GRU,LSTM)and their variants(Bi RNN,Bi LSTM,Bi GRU)and Transformer combined with both One-Hot and Word Embedding coding approaches were applied to the recognition of plant non-coding RNAs.Firstly,the original sequence is pre-processed and truncated or complemented;secondly,the processed sequence is applied to two encoding methods,One-Hot and Word Embedding;finally,the data after these two digital encoding methods are given as input to RNN,GRU,LSTM,Bi RNN,Bi LSTM,Bi GRU and Transformer networks to construct 14 recognition models.The results show that the model constructed using the Word Embeeding encoding method as input is superior to the model constructed using the One-Hot encoding method as input,with the word Embedding+Transformer achieving the best accuracy of 92.35% in the validation set.It was also found that the longer the sequence length the less effective the model was,and the method also performed better on nine independent test sets,all with an accuracy of 89.5% or more.Designed and developed a machine learning based plant non-coding RNA identification system.The plant non-coding RNA identification system was developed using the Python language,and five pages were designed: Home,Batch Upload,Download,Help,and About Us.The core of the system is the built-in machine learning algorithm,which enables the input of plant non-coding RNA sequences to obtain identification results,reducing the user’s labour and material costs.The system provides users with a convenient and efficient tool for the subsequent study of plant non-coding RNAs.
Keywords/Search Tags:Plant, Non coding RNA, Automatic machine learning, Deep learning
PDF Full Text Request
Related items