Font Size: a A A

Research On Protein Complexes Identification In Protein Interaction Networks

Posted on:2015-09-18Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y YuFull Text:PDF
GTID:1220330422492443Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Protein complex is a group of two or more stable proteins associated at the same timeand space through interaction. At present, the number of protein complexes obtained byexperiments is limited and the cost is high. Thus, the use of computational methods foridentifying protein complexes has important practical significance and application value.This paper investigates the identification method of protein complexes from the followingfour aspects, as follows:Firstly, many methods for the protein complexes identification mainly are based onthe topology information in protein-protein interaction to cluster in the graph without con-sidering much of the information in the protein sequence. With this purpose, local searchgraph clustering algorithm based on fusion feature is proposed. At beginning, during thefeature extraction process, the amino acid background frequency in protein sequence isintroduced and combined with the topology information. Second, the similarity measure,the cosine similarity, is applied to locate the complexes. Finally, in the judging section,topology and biological information are both used. The algorithm allows diferent proteincomplexes to overlap each other. Experiments indicate that the algorithm can efectivelymatch more real protein complexes.Secondly, the real topologies of protein complexes exist diversely, such as line, star,clique and hybrid, and cannot be described uniformly. In order to fully describe its topol-ogy diagram, supervised clustering method combining topology and biological character-istics is proposed based on support vector machine (SVM). Firstly, the algorithm buildsthe negative set according to the distribution of the real complexes; secondly, the usefulalgorithm is designed. Finally, in the process of recognition, topological constraints andSVM are jointly used to determine the identified clusters. The experimental results showthat supervised graph clustering algorithm based on the support vector machine has betterperformance compared with several other classical algorithms in F-Measure. Exampleanalysis shows that the algorithm can identify the hybrid topological structure.Thirdly, protein-protein interaction data contains noise data and it corrupts the pre-diction results. Although clustering algorithms have been proposed to improve the de-tection performance of the protein complexes, these methods consider only the topology information on the unweighted graph. To solve these problems, a weighted graph of com-plexes identification which combines multiple-data-source information is proposed. Themain task of this part is to build three weighted graph, namely the weighted graph basedon gene ontology, the weighted graph based on von Mering degree of confidence andthe weighted graph based on von Mering and gene ontology information. The algorithmbased on fusion of von Mering and gene ontology information performs better than othertwo methods in the aspects of recall, precision and F-measure respects.Lastly, the complex formation is not only afected by the topology information, butalso by the spatial constraint. Currently, most recognition methods, which extract only alocal dense area network to identify protein complexes, are mainly based on the topologyinformation of protein-protein interaction without considering the spatial structure re-strictions, which brings some false positive data inevitably. Based on this observation, thesubcellular spatial information is introduced to identify protein complexes. The main re-searches include sorting strategy based complexes identification and fusion strategy basedcomplexes identification. In sort-based strategy, a collection of candidate clusters are ob-tained in the interaction network, and sorting strategy selection is used, which efectivelyfilters false positive complexes. Fusion methods identify protein complexes respectivelyin spatial and topological network and get the final cluster according to merging strategy.This method can efectively improve the performance of L, CFinder, MCODE and MCLin terms of F-measure performance. Four clustering algorithms proposed in this paper isdescribed and the relationships among them and the application occasions are analyzed.
Keywords/Search Tags:Protein-protein interaction network, Protein complex, Amino acid back-ground frequency, Gene ontology, Semantic similarity
PDF Full Text Request
Related items