Font Size: a A A

Protein Interaction Network-based Clustering Algorithm Research And Its Application

Posted on:2020-07-28Degree:DoctorType:Dissertation
Country:ChinaCandidate:B LiFull Text:PDF
GTID:1360330623951641Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Relying on the multi-class and multi-level system biology basic data incluing genomics,proteomics,metabolomics,and regulatory networks analyzed and collected,developing models systematically analyze the dynamics of all components such as proteins,genes,mRNA in living organisms and designing personalized precision medicine for each patient will be the future trend of bioinformatics.Among them,the research and application of clustering algorithms based on protein interaction networks is the basic research problem in the field of computational biology.The main research line of this research is based on protein interaction network developing clustering algorithm for general protein complex recognition and disease association module mining and other applications,in addition,protein complexes and disease function modules are closely related to protein subcellular location information,and the current data are incomplete and with high proportion of false positives,we have also studied the issue of protein subcellular localization prediction.In summary,the main research contents of this topic are as follows:(1)Protein subcellular localization based on fusion of multi-window features.In view of the current protein sequence representation methods such as amino acid composition,pseudo amino acid composition,etc.,it is difficu lt to fully exploit the interaction information between residues and residues and the position distribution information of each residue.This paper first proposes two methods for extracting sequence features: This is a 2-dimensional feature based on the improved chaotic game model,focusing on mining the frequency and global position distribution information of the main sequence.The other is a new 3-dimensional feature based on statistical information theory,which mainly reflects the local part of the res idue.location information.Then,a classification model based on 5-dimensional features and unitary distance is designed.Its advantage is that it can quickly predict sub-cellular location without the need for time-complexity classification systems such as SVM,and the accuracy exceeds some SVM-based classification models.In order to further improve the prediction accuracy and usability of the system,we combine the new 5-dimensional features,pseudo-amino acid composition and dipeptide characteristics and use SVM as the classification system.The experimental results show that the multi-window model predicts significantly more than almost all classical algorithms.It also indirectly reflects the two new features that can be used as an effective complement to the current classic features.Finally,subcellular location predictions for some proteins were judged to be false positives,but we used text mining techniques to verify from authoritative journals that they were not yet included in the public database.(2)Research on protein complex mining algorithm based on core-attachment structure.Protein complexes are the main carrier for performing cell functions in organisms,and most of them have been found to have core-affiliated structures.Aiming at the problems of difficult identification of overlapping modules and poor biodecodability of predictive complexes in protein interaction network clustering algorithms,this paper proposes a model CFOCM for mining core-affiliated protein complexes,which first defining a new affinity function by fuseing gene ontology annotations item and their inter-functional revalance,which ensures that the core of the complex can have a relatively sparse internal interconnect with a relatively sparse topology and tends to have at least one identical biological function,and then reciprocate selectively the peripheral proteins to the complex core to form the final protein complex according to the set closeness strategy.The clustering model CFOCM performs better than the existing algorithms(ClusterONE,MCL,CORE,COACH,etc.)on the relatively sparse and dense network datasets of different types,which proves the effectiveness of the algorithm.And high adaptability,in addition,the contrast experiment also shows that the assumption that the core has at least the same shared function by means of gene ontology annotations and this definition effectively improves the performance of the algorithm.(3)Research on disease association module identification based on multi-objective evolutionary computation framework.Digging out the functional modules associated with disease can help to screen new drug targets and uncover the mechanisms for the development of complex diseases.At present,there are relatively few research results in this aspect.This paper proposes a disease correlation module prediction model MPSOPC based on multi-objective evolutionary computation framework.The advantage of multi-objective optimization framework is that it can simultaneously optimize the balance between multiple objectives such as the internal density of clusters,the connectivity between clusters and clusters and the closeness of each protein in the module to a certain disease give an optimal solution set.In addition,the model can fully exploit the global topological properties of the network.The experimental results confirm that MPSOPC can identify densely interconnected clusters and relatively sparse topology complexes between clusters and clusters,and the identified complexes are highly correlated with certain types of diseases.In addition,MPSOPC is characterized by high efficiency and robustness,which can be an effective tool to help identify potential disease-causing gene sets and new drug targets.
Keywords/Search Tags:Feature extraction, Chaos game model, Core-attachment, Protein complex, Disease-related module, Multiobjective Optimization
PDF Full Text Request
Related items