Font Size: a A A

Exploring the power of heterogeneous information sources

Posted on:2012-06-06Degree:Ph.DType:Thesis
University:University of Illinois at Urbana-ChampaignCandidate:Gao, JingFull Text:PDF
GTID:2458390008495589Subject:Computer Science
Abstract/Summary:
The big data challenge is one unique opportunity for both data mining and database research and engineering. A vast ocean of data are collected from trillions of connected devices in real time on a daily basis, and useful knowledge is usually buried in data of multiple genres, from different sources, in different formats, and with different types of representation. Many interesting patterns cannot be extracted from a single data collection, but have to be discovered from the integrative analysis of all heterogeneous data sources available. Although many algorithms have been developed to analyze multiple information sources, real applications continuously pose new challenges: Data can be gigantic, noisy, unreliable, dynamically evolving, highly imbalanced, and heterogeneous. Meanwhile, users provide limited feedback, have growing privacy concerns, and ask for actionable knowledge. In this thesis, we propose to explore the power of multiple heterogeneous information sources in such challenging learning scenarios. There are two interesting perspectives in learning from the correlations among multiple information sources: Explore their similarities (consensus combination), or their differences (inconsistency detection).;In consensus combination, we focus on the task of classification with multiple information sources. Multiple information sources for the same set of objects can provide complimentary predictive powers, and by combining their expertise, the prediction accuracy is significantly improved. However, the major challenge is that it is hard to obtain sufficient and reliable labeled data for effective training because they require the efforts of experienced human annotators. In some data sources, we may only have a large amount of unlabeled data. Although such unlabeled information do not directly generate label predictions, they provide useful constraints on the classification task. Therefore, we first propose a graph based consensus maximization framework to combine multiple supervised and unsupervised models obtained from all the available information sources. We further demonstrate the benefits of combining multiple models on two specific learning scenarios. In transfer learning, we propose an effective model combination framework to transfer knowledge from multiple sources to a target domain with no labeled data. We also demonstrate the robustness of model combination on dynamically evolving data.;On the other hand, when unexpected disagreement is encountered across diverse information sources, this might raise a red flag and require in-depth investigation. Another line of my thesis research is to explore differences among multiple information sources to find anomalies. We first propose a spectral method to detect objects performing inconsistently across multiple heterogeneous information sources as a new type of anomalies. Traditional anomaly detection methods discover anomalies based on the degree of deviation from normal objects in one data source, whereas the proposed approach detects anomalies according to the degree of inconsistencies across multiple sources. The principle of inconsistency detection can benefit many applications, and in particular, we show how this principle can help identify anomalies in information networks and distributed systems. We propose probabilistic models to detect anomalies in a social community by comparing link and node information, and to detect system problems from connected machines in a distributed systems by modeling correlations among multiple machines.;In this thesis, we go beyond the scope of traditional ensemble learning to address challenges faced by many applications with multiple data sources. With the proposed consensus combination framework, labeled data are no longer a requirement for successful multi-source classification, instead, the use of existing labeling experts is maximized by integrating knowledge from relevant do- mains and unlabeled information sources. The proposed concept of inconsistency detection across multiple data sources opens up a new direction of anomaly detection. The detected anomalies, which cannot be found by traditional anomaly detection techniques, provide new insights into the application area. The algorithms we developed have been proved useful in many areas, including social network analysis, cyber-security, and business intelligence, and have the potential of being applied to many other areas, such as healthcare, bioinformatics, and energy efficiency. As both the amount of data and the number of sources in our world have been exploding, there are still great opportunities as well as numerous research challenges for inference of actionable knowledge from multiple heterogeneous sources of massive data collections.
Keywords/Search Tags:Sources, Data, Heterogeneous, Multiple
Related items