Font Size: a A A

On Graph Perturbation Theory and Algorithms for Scalable Mining of Noisy and Uncertain Graph Data with Knowledge Priors

Posted on:2011-02-28Degree:Ph.DType:Thesis
University:North Carolina State UniversityCandidate:Hendrix, William ThomasFull Text:PDF
GTID:2448390002455232Subject:Computer Science
Abstract/Summary:
The solutions to many interesting questions lie in the ability to extract sets of tightly connected or correlated elements from a mass of relational-type data. For example, a researcher may formulate various questions about social communities within a social network, correlated stocks in market data, or proteins in a protein interaction network that form functional modules. However, such questions are often difficult to answer, as the data may exhibit noise or uncertainty, objects may belong to multiple groups, the data to be analyzed may be very large in scale, and the results of a naive algorithm may not be relevant to the user's interests. In this thesis, we make several novel contributions to solving this problem, including applying the knowledge priors of a domain expert to the algorithmic search process, as well as introducing the concept of Graph Perturbation Theory in order to cope with uncertainty.;In our first component, we tackle the problems of noisy and uncertain data by introducing the concept of Graph Perturbation Theory, which consists of calculating graph properties as the graph undergoes some perturbation, or change. This perturbation may represent some new knowledge, reinterpretation of old data, or exploratory "what if"-type scenario. We develop and apply Graph Perturbation Theory to the problem of enumerating the fully connected subgraphs of a graph. The advantage in modeling our correlated groups as maximal cliques is that cliques can produce overlapping groups, and their fully connected nature ensures a strong positive relationship despite uncertainty in the individual edges present. By enumerating the set of all such maximal cliques of the graph, we can exhaustively describe the correlations present in the graph and avoid relying on heuristic methods. Our application of Graph Perturbation Theory to this problem resulted in speedups of up to 120 times over enumerating the maximal cliques of the perturbed algorithm using a state-of-the-art clique enumeration algorithm.;In our second component, we harness the knowledge priors of domain experts by allowing a user to specify the "items of interest" that the correlated items should be enriched with respect to, allowing us to constrain the search space of the algorithm as well as improve the chance of discovering sets of identifying correlated items relevant to the user's interests. We tackle the problem of incorporating experts' knowledge priors into the search for overlapping groups in noisy and uncertain data by proposing a novel subgraph structure, the mu, gamma-quasi-clique. Our definition, which incorporates the information contained the user's query, is a relaxation of the fully connected clique property, and enables us to better compensate for missing, or false negative, relationships in our network, which would fracture a single large group into multiple, highly overlapping cliques. We evaluate the effectiveness of our approach by demonstrating its scalability to larger graphs and reproduce biological results that establish the practical value of our work.;In our third component, we address the large scale nature of modern datasets by developing parallel strategies for Graph Perturbation Theory algorithms. As Graph Perturbation Theory algorithms utilize various indices to facilitate calculating graph properties of the perturbed graph, we discuss techniques for dealing with the highly data-intensive nature of such algorithms. To establish the effectiveness of our proposed techniques, we implement a parallel version of the perturbed clique enumeration algorithm presented in the first component, developing additional theory that allows us to decouple the action of this algorithm from its previous computation. We present detailed empirical results demonstrating the scalability of our algorithm as well as the effectiveness of the various techniques we propose.
Keywords/Search Tags:Graph perturbation theory, Algorithm, Data, Knowledge priors, Noisy and uncertain, Correlated, Connected
Related items