Font Size: a A A

Clustering Analysis Of Complex Biological Networks

Posted on:2011-04-06Degree:DoctorType:Dissertation
Country:ChinaCandidate:J MeiFull Text:PDF
GTID:1100360302487739Subject:Fermentation engineering
Abstract/Summary:PDF Full Text Request
Genomics and high-throughput technologies have yielded large amount of relational data about the components (like proteins) of living systems; such data form various complex biological networks that carry rich knowledge about the systems'mechanisms. In the post-genome era, a current challenge is to mine the hidden knowledge stored in the networks. As an important means for knowledge discovery, graph clustering attracts much attention in the analysis of complex biological networks.We develop a high-performance graph clustering algorithm CD (contraction-dilation) in this paper, investigate the detection of protein remote homology based on protein similarity network and analyze functional modules in protein-protein interaction network. To evaluate graph clustering algorithms, a near-realistic network model with key characteristic of the research subject is proposed. The main contents include:A novel graph clustering algorithm CD based on modularity optimization is proposed. The modularity is a global index describes the strength of community structures of complex networks. The proposed algorithm is extensively tested on computer-generated benchmark networks, biological networks and social networks. Compared with existing algorithms, CD algorithm is efficient to discover more stable community structures with higher modularity scores and accuracies at lower expenses of both CPU time and memory. Thus, it can be applied to analyze large networks. CD algorithm is the basis for this paper.Remote homologues are hidden in the twilight zone, where their sequence similarity is too low to distinguish them from pairs of sequences with equal or even higher sequence similarity due to chance. To extract weak signals about homology from sequence similarity with high noises, we apply CD algorithm to a protein similarity network to detect remote homology. The similarity matrix is formed by the sigmoidal transformation of the Smith-Waterman alignment scores, and then fed into the CD algorithm. The method is tested on several datasets from superfamily level of SCOP database. Numerical experiments show that CD algorithm manages to detect protein remote homology, for the communities it outputs are largely corresponding to protein superfamilies and the number of communities is close to the number of superfamilies in the given dataset. Also, the performance of CD algorithm is superior to spectral clustering algorithm and MCL algorithm. The results show that sequence similarity carry significant information on remote homology, which can be mined by using CD algorithm.Protein-protein interaction network is a typical complex biological network with many nodes and unevenly distributed links, thus is not easily divided into communities with statistically significant. We use CD algorithm to partition a yeast protein-protein interaction network, which is constructed by selecting yeast protein interactions with medium and high confidence from the literature. For the communities outputs by CD algorithm, we first assess the validity from the topology perspective and find that they are densely connected local subgraphs; then we find known protein complexes annotated by MIPS database are largely contained in them in their entirety. This indicates that these communities may have biological significance. Moreover, we annotate these communities based on P-values. And we investigate the functions of individual proteins in communities and find that most of them usually share common functions, which are in accordance with the main functions of communities.In most cases, we know little background knowledge of real networks. To evaluate the performance of graph clustering algorithms on real networks, a near-realistic network model (NR model) is constructed. Inferring the ability of a graph clustering algorithm to analyze real networks through its performance on model networks. To ensure rationality of this inference, the model network constructed has the same first-order statistical properties as the network under investigated, that is, degree of every node in the model network is in accordance with the corresponding node in the network under investigated. Meanwhile, the model network constructed can have any given community structure. It is equivalent to give the background knowledge, that is, the real classification information is known. NR model constructed here provides a way to evaluate graph clustering algorithms objectively.
Keywords/Search Tags:complex biological networks, graph clustering, sequence similarity, protein remote homology, protein-protein interaction network, protein complex, functional module, network model
PDF Full Text Request
Related items