Font Size: a A A

Exploiting multiple data sources for network mining

Posted on:2014-07-06Degree:Ph.DType:Thesis
University:Michigan State UniversityCandidate:Mandayam Comar, PrakashFull Text:PDF
GTID:2458390005483357Subject:Computer Science
Abstract/Summary:
Network mining is an active research area with application to diverse fields including computer science, social science, and biological sciences. However, previous studies have focused mostly on developing algorithms for mining data from a single network. Such algorithms are susceptible to imperfections in the network data such as noisy links and node attribute values. The focus of this thesis is on exploiting multiple data sources to enhance the performance of network mining algorithms for community detection, node classification and link prediction tasks.;The first contribution of this thesis is the development of a joint matrix factorization framework for mining multiple networks. The framework offers a principled way to perform community detection simultaneously across multiple related networks. It is also highly flexible, allowing the link structure, node attributes, and any prior knowledge about the relationship between communities in different networks to be seamlessly integrated under a unified formulation. The framework is then extended to a multi-task learning setting where one could perform community detection on one network and node classification on the other. Multi-task learning is natural for networks considering the intimate relation between the link structure and node attributes of the networks. However, designing a framework for multi-task network learning requires a joint objective function that can be used for various network mining tasks while accommodating some of the existing objective functions (such as the well-known modularity measure for community detection). As second contribution, this thesis presents a novel cost-sensitive loss function that enables the joint learning for link prediction and community detection on one or more networks. The loss function addresses the class skewness and degree skewness problems inherent in most link prediction tasks. A formal proof is provided to show the equivalence between the proposed loss function and the modularity measure used in community detection. To enhance the scalability of the approach, a divide and conquer scheme was developed where the learning algorithm is applied to smaller partitions of a network and their results are systematically combined using the boosting framework.;Acquiring reliable labels is crucial for network learning tasks such as link prediction and node classification. While for the most part the labels can be gleaned from the network itself, they are often incomplete and noisy, thus requiring alternative mechanism to solicit more label information. This thesis explores the viability of using crowdsourcing technology as an external source for obtaining additional labeled data for network mining tasks. Adopting crowdsourcing for network data is non-trivial due to the difficulty in designing a human intelligence task (HIT) that can be easily handled by non-experts (i.e., crowd). To overcome this problem, this thesis proposes an approach for transforming network data into a set of images that can be easily labeled by non-experts. The conditions under which the transformation preserves the original network data was also examined. To the best of our knowledge, this is the first study to examine the use of crowdsourcing for acquiring labels in network learning tasks.;This thesis is a step forward towards resolving some of the fundamental challenges in performing multi-source network mining. Though the methods described in this thesis were designed for network mining, some of them (e.g., methodology to transform network data into image data) are applicable to non-network learning problems.
Keywords/Search Tags:Network, Data, Community detection, Multiple, Link prediction
Related items