Font Size: a A A

Research On The Algorithm Of RBP Binding Sites And Motif Identification Based On CLIP Data

Posted on:2020-03-17Degree:MasterType:Thesis
Country:ChinaCandidate:Y G XiaoFull Text:PDF
GTID:2370330575463647Subject:Electronics and Communications Engineering
Abstract/Summary:PDF Full Text Request
RNA-binding proteins(RBPs)play a very important role in the gene expression of organisms.RBPs affect the formation of mature mRNA by binding to RNA sites,thus affecting the synthesis of biological proteins.With the development of molecular biology,it has been recognized that the binding of RBPs to RNA is specific,that is,specific RBPs have higher affinity for certain RNA sites.Techniques for identifying RBPs targeting sites include in vitro selection and RNA co-immunoprecipitation,such as cross-linked immunoprecipitation(CLIP),which,unfortunately,is lengthy and difficult,requiring significant time and labor input.In addition,the application of traditional statistical methods in this field has the characteristics of general predictive performance and weak interpretability of the model,which cannot provide effective guidance for people.Fortunately,the rapid development of computer technology represented by deep learning and the maturity of CLIP technology in recent years have shown new opportunities for the research of target site identification of RBPs.Based on the CLIP data corresponding to 17 human RBPs,this study used a deep learning method to construct a model for predicting RBPs targeting sites,and predicted the binding of RBPs.The specific work of this paper is as follows:(1)Data acquisition and preprocessing.The data used in this study were from the iCount and DoRiNA databases.When acquired,we used peak sampling to obtain significant CLIP data to try to exclude false positive data.Then based on the sequence data,we used techniques such as RNA folding to obtain other dimensional data of the sequence,such as secondary structure data,CrossBinding and RegionType data.(2)The SOCN model is proposed based on the sequence data.The SOCN model takes the one-hot code of the sequence as input,and uses the convolutional neural network to automatically abstract the sequence information,avoiding human intervention and selection,and then fully connecting.The layer and the Softmax layer classify the input information.The average AUC value of the SOCN model in the benchmark data set reaches 0.823,which is superior to other models.(3)Through the analysis of the results of the SOCN,it is found that the classification effect is not good for a specific RBP.After analysis,it is found that this kind of RBPs tends to combine with structured sequences.Therefore,based on the SOCN,there is a hybrid modeL(MSM)that proposes multiple data sources.The model takes a variety of data as input,in addition to the sequence information.Includes CrossBinding,secondary structure data,and RegionType data.Finally,the MSM model overcomes the shortcomings of the SOCN model on the same dataset,with the average AUC value of 0.90 becoming the best model,which improves performance by 10.9%,12%,and 13.9%,respectively,compared to the SOCN,iONMF,and Oli models.(4)Given the excellent performance of the MSM model,we delved into the model parameters.The convolution kernel in MSM is similar to a motif scanner,which can identify significant sequence features,then convert the significant sequence feature set recognized by the convolution kernel into a motif and visualize it with the WebLogo tool.Finally,the Tomtom algorithm is used to compare the predicted motif with the database.The results show that 78%of the predicted motifs can match the database with high confidence.
Keywords/Search Tags:RNA-binding proteins, Deep Learning, CLIP
PDF Full Text Request
Related items