Font Size: a A A

Research On Function Annotation Of Biological Macromolecules Based On Machine Learning

Posted on:2022-12-21Degree:MasterType:Thesis
Country:ChinaCandidate:W Q XiaFull Text:PDF
GTID:2480306752476374Subject:Medicinal chemistry
Abstract/Summary:PDF Full Text Request
Due to the rapid development of high-throughput sequencing technology,massive amounts of unannotated non-coding RNAs and proteins are generated and included in public databases.Predicting and annotating these data is one of the important unsolved problems in the field of bioinformatics.On the one hand,non-coding RNAs play an important role in the process of gene expression and are closely related to the occurrence and development of diseases such as cancer and the nervous system.Therefore,the identification of non-coding RNAs is of great significance for understanding the development of diseases and the discovery of disease markers or drug targets.On the other hand,proteins participate in the activities of most living organisms and perform various functions in the body.Exploring the function of proteins can effectively enhance the understanding of life processes and provide new drug targets for the development of innovative drugs.At present,there are more than 200 million protein sequences in the Tr EMBL database,and less than 1% of the protein functions are obtained through experiments,thus the use of computational methods to predict the function of protein becomes an essential tool.The rapid development of machine learning algorithms in the field of artificial intelligence has made it possible to predict the function of a large number of gene products.Therefore,it requires researchers to develop efficient,fast and reliable algorithm models to assist the annotation of gene products.In this research,reliable machine learning and deep learning methods are adopted for the annotation of non-coding RNAs and proteins.And this research mainly focuses on the following two dimensions:1.The online website CORAIN was constructed to provide methods for the annotation and associate interaction prediction of non-coding RNAs.First,we provided a comprehensive feature extraction methods for non-coding RNA from three perspectives(sequence,physicochemical properties,and structure),and also used popular natural language models to establish some new feature encoding methods,which can be used for machine learning and deep learning model;The online website CORAIN was constructed to enable the multiple predictions of non-coding RNAs,including the identification and classification of non-coding RNAs,the interaction prediction of non-coding RNAs with other non-coding RNAs,proteins and small molecule;Besides,we had offered the evaluation metrics for the prediction and realized the visualization of the prediction results.Users could select the encoding methods according to their needs.CORAIN covered the integrated process of feature extraction,feature integration,classifier construction as well as performance evaluation and can provide machine learning algorithms and various deep learning algorithms for researchs,which could promote the study of non-coding RNA.2.Based on the Gene Ontology(GO),a new protein function annotation tool,PFmul DL,was proposed.PFmul DL was composed of multi-layer CNN and multi-layer GRU for the first time for the protein functional annotation.First,One-Hot encoding method was used as the input of the model.In the training process,the neural network could learn the relationship between the input and the label through continuous mapping.Second,the transfer learning method was introduced to improve the final performance of the model through pre-training and fine-tuning.The evaluation on the independent test dataset of our model was compared with existing protein annotation methods,and the performance of PFmul DL generally performed better in Fmax.In addition,the PFmul DL could predict more than 5800 GO families and was currently the tool that could predict the largest number of GO families at one time.In order to further explore the performance of the model in families with different number of samples,the protein families were divided into 10 levels according to the structure of GO.A method for evaluating and analyzing the GO level was proposed.Compared with other method,it was found that PFmul DL was capable of significantly elevating the prediction performance for these GO families with low samples.Thus,PFmul DL would become an essential complement to the existing methods for protein function prediction.
Keywords/Search Tags:Protein function annotation, The identification of nc RNA, The prediction of nc RNA-related reactions, Feature encoding methods, Machine learning
PDF Full Text Request
Related items