Font Size: a A A

Research On Query And Analysis In Heterogeneous Information Networks

Posted on:2017-02-16Degree:DoctorType:Dissertation
Country:ChinaCandidate:D YinFull Text:PDF
GTID:1108330503469765Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Information networks represent entities and the relations between entities in real world. With the technological advancements and internet penetration, the applications of information networks are various, such as social networks, biological networks and traffic networks. Information networks can be modeled by graph data model, which has two elements: Nodes and edges. Nodes and edges in information networks correspond to entities and relations in real world, respectively. According to the numbers of types of nodes and edges, information networks can be classified into two categories: Homogeneous information networks and heterogeneous information networks. In homogeneous information networks, the types of nodes and edges are both identical, such as friendship networks and coorperation networks. While, in heterogeneous information networks, there exist multiple types of nodes and relations. Most real world information networks are heterogeneous. Heterogeneous information networks are powerful and expressive representations of general real-world interactions between different kinds of network entities in diverse domains, such as knowledge graphs, social networks and cyber-physical system. Rich information is implied so that it is especially important to do research work on query and analysis in heterogeneous information networks. This thesis focuses on query and analysis problems in heterogeneous information networks by combining the complicated structures and rich information, based on algorithm theory, data analysis and computational complexity. The main contributions of this thesis are summarized as follows.1. We investigate the reachability problem in heterogeneous information networks.Rearchability query is to find if there exist paths linking two nodes, which is a type of basic query in information networks. When we investigate the relevance of nodes, the first we consider is the reachability of two nodes. However, reachability query in homogeneous information networks does not involve the types of nodes and edges, which is also based on directed acyclic graphs. While, in heterogeneous information networks,cycles often exist. It will lose the path information between different types of nodes if we compress a cycle into a node in heterogeneous information networks. The existing work can not solve the problem of reachability query in heterogeneous information networks. We formally define the problem and prove the problem is PTIME. However, with the increasing sizes of information networks, the cost of scanning the networks once can not be ignored. Therefore, in order to respond queries efficiently, a novel index structure called MP index is proposed, which decomposes meta paths and constructs the partial graph of meta paths, to precompute the reachability of nodes in meta paths selected from the partial graph. Multiple queries can share the same reachability information of index.Experiments on real world and synthetic data demonstrate the efficiencies of algorithms.2. We investigate the aggregate problem in heterogeneous information networks. Aggregation allows users observe the data summary from a specific perspective and granularity, which is the foundation of multiple dimension data analysis. Aggregation in homogeneous information networks is based on the dimensions of attributes of nodes,without considering the types of nodes, types of edges and the structures of networks.Heterogeneous information networks include multiple types of relations, which should be taken into consideration for aggregation. The existing work of aggregation on information networks can not be used on heterogeneous information networks. We investigate the aggregate problem on large-scale heterogeneous information networks, whose dimensions are types of nodes, attributes of nodes and types of edges. Moreover, by considering both attributes and structures, we propose a novel function based on graph entropy to measure the similarities of nodes on different types of edges. Further, we prove the aggregation problem is NP-hard. Therefore, we develop an algorithm for aggregation in two phases: Informational dimension aggregation and structural dimension aggregation. The algorithm has linear time and space complexity. We prove the approximate ratio of the algorithm. Extensive experiments on real world data sets demonstrate the algorithm can analyse the networks from a specific perspective effectively. Meanwhile, the algorithm has good scalability.3. We investigate the cube computation problem in heterogeneous information networks. Cube provides users with observing the networks from different aspects, which is the core technique of multiple dimension data analysis. Due to the limitations of aggregation on homogeneous information networks, the computation of partial cube materialization only relies on the attributes. Select partial cube to materialize according to the inclusion relations between the subsets of attributes. The complexity of dimensions in heterogeneous information networks makes the existing work not applicable to heterogeneous information networks. We construct the cube model for heterogeneous information networks including multiple dimensions: Types of nodes, attributes of nodes and meta paths. The problem of partial cube materialization on heterogeneous information networks is studied. This optimization problem is proved to be NP hard. A greedy algorithm is proposed for partial cube materialization based on two interesting dependencies between aggregate graphs: Attribute dependency and path dependency, which are used for defining cost model and constructing lattice. We prove the approximate ratio of the greedy algorithm. Experiments on real world data sets show the algorithms can analyze the networks effectively and the partial cube materialization algorithm is efficient.4. We investigate iceberg cube problem in heterogeneous information networks. Iceberg cube is to compute the cube whose aggregate results which are larger than a specific threshold, which is an important operation of multiple dimension data analysis. The existing work of iceberg cube on information networks is based on the attributes of nodes in homogeneous information networks. Obviously, this is not suitable for the iceberg cube of heterogeneous information networks with complicated semantics and structures. For heterogeneous information networks, iceberg cube involves attribute dimension, type dimension and structure dimension. Moreover, the aggregate functions are more complex.We need a new iceberg cube definition for describing the complex semantics and structures of heterogeneous information networks. The iceberg cube problem in heterogeneous information networks is formally defined. We prove the problem is NP-hard. A general approximate algorithm is proposed for fast aggregation, which devises random walk on meta-paths to estimate the similarities. We prove the upper bound of the similarity relative error. Two effective pruning strategies are designed for pruning the cuboids or terminating the aggregation, when the aggregate function is monotonic. Experiments over real world and synthetic networks demonstrate the algorithms can discover the interesting aggregate results for users, and the approximate algorithm and pruning strategies are efficient.
Keywords/Search Tags:Heterogeneous information network, meta path, reachability, aggregate graph, cube, iceberg cube
PDF Full Text Request
Related items