Font Size: a A A

Research On Protein Complex Mining And Its Application

Posted on:2020-03-28Degree:DoctorType:Dissertation
Country:ChinaCandidate:J ZhaoFull Text:PDF
GTID:1360330602958562Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
As direct participants in life activities,protein complexes are the basis for understanding protein function,studying disease pathogenesis,and exploring complex life processes.They are the focus of proteomics research.As the approach of the post-genomic era,the rapid development of high-throughput biotechnology has provided a huge source of data for the mining of protein complexes while bringing opportunities and challenges.How to dig protein complexes from multi-mode,multi-dimensions and multi-type biological networks-has become a hot research topic.Most of the existing protein complex mining algorithms combine the protein biometric data to mine sub-graphs that satisfy certain network structures on the protein interaction network,and do not reflect the dynamic characteristics and overlapping of protein complexes.The biological information contained in these data is not fully explored.This dissertation focus on protein complex to study the mining of dynamic protein complexes and overlapping protein complexes and the application of protein complexes in identifying key proteins and predicting circRNA-disease associations.As mentioned above,four main works were completed and described in this dissertation.In the first section as described in Chapter Three,applying the cuckoo search algorithm to the problem of protein complex mining for the first time,the optimization theory and internal bionic mechanism of cuckoo search algorithm were dissected deeply and then a new clustering model for dynamic PPI networks based on cuckoo search algorithm was constructed to mine dynamic protein complex.Cuckoo search algorithm which simulats cuckoo nest parasitic reproductive behavior and levy flight foraging behavior owns the characteristics of less parameters and strong spatial search ability,so it is widely used in many fields such as clustering.The color,size and patch of host eggs and idiophase are overall considered when the cuckoo selects host nest.This is like the structure of the core attachment of the protein complex.The subordination of attachment nodes was estimated to the similarity of core node in the structure of protein core adnexa.Baesd on fusion of the behavior of cuckoo bird's nest parasitic,Levi flying foraging behavior and the framework of protein complex's core attachment,the improved cuckoo search clustering algorithm was designed and applied to mine dynamic protein complexes in dynamic PPI networks.This study not only expands the application field of cuckoo search algorithm,but also provides a good research idea for mining dynamic protein complexes.In the second part,the ICSC algorithm was proposed in the first work,which can effectively identify dynamic protein complexes and small protein complexes.However,it cannot identify overlapping protein complexes.In order to identify overlapping protein complexes,the quotient space theory and protein interaction network were innovatively merged.An overlapping protein complex mining algorithm(ONCQS)based on quotient space dynamic network chain was proposed.Proteins form different protein complexes due to the functional differences.Gene ontology functional annotation data was used to analyze the functional similarity of proteins,and protein networks were weighted.On the weighted network,a quotient space dynamic network chain is constructed on each level of network.According to the function of the protein,proteins were divided.Finally,a multi-stage dynamic network was integrated to obtain the final overlapping protein complex.The simulation results shown that the proposed algorithm not only could effectively identify overlapping protein complexes,but also could identify perfectly matched protein complexes more accurately.Thirdly,in order to identify key proteins in the current fusion of multiple biometric data,the information of biological data is not utilized fully,the difference between protein individuals cannot be reflected.Therefore,the prediction accuracy needs to be further improved.The information entropy theory is introduced to measure protein biology.Based on the amount of feature data,an algorithm(NIE)for identifying key proteins based on second-order neighborhood information and information entropy was proposed.Firstly,the comparison verified that RNA-seq data could better reflect the gene co-expression characteristics.We used RNA-seq data to measure gene co-expression characteristics and used GO function annotation data to measure protein functional similarity.From the perspective of weighting the network to improve the reliability of the protein network,calculating the subcellular localization information entropy of the protein and the protein complex information entropy to measure the amount of biological information were carried by the protein node itself.Finally,we considered the second-order neighborhood between the interaction of internal protein nodes,sorting protein nodes and predicting key proteins.The simulation results shown that the information entropy can accurately reflect the amount of biological information which could be carried by the protein and reflect the difference between individual proteins.The accuracy and precision of the NIE algorithm were significantly improved.This part of the work provided an idea of how to effectively use multiple types of biological information.In the fourth part,in order to solve the problem that the existing circRNA-disease associations prediction algorithm relies too much on prior knowledge and the data is sparse severely,which are affecting the prediction accuracy of the algorithm,a method based on two-way collaborative filtering and protein composite network information was proposed(BiCFC)to predict circRNA-disease associations.The BiCFC algorithm made full use of Gaussian kernel similarity,circRNA sequence similarity and dissease semantic similarity to construct a new heterogeneous network.For the problem of too little prior knowledge,a two-way collaborative filtering model was designed to make full use of known relationship,which could be explained as follows:a protein complex network was constructed to calculate the potential relationship between circRNA and disease to overcome data sparsity.The results of leave-one-out cross-validation,5-fold cross-validation and 10-fold cross-validation shown that the accuracy of the BiCFC algorithm was significantly improved.Through some single disease analyses,it was found that BiCFC could accurately identify circRNAs associated with multiple diseases,and could also discover new circRNA-disease associations,which solved the common problem in the prediction of circRNA-disease associations.
Keywords/Search Tags:Protein complex, Essential protein, Quotient Space, Information entropy, circRNA
PDF Full Text Request
Related items