Font Size: a A A

Study On Key Techniques Of Entity-Relation Correlation Analysis Based On Graph

Posted on:2015-07-02Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y WuFull Text:PDF
GTID:1108330509460993Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
With the development of technologies of information access, storage and dissemination, it has been the hotspot of many research fields to retrieve useful knowledge from large-scale datasets quickly. Nowadays, lots of researchers devote themselves to explore highly-efficient querying and analyzing methods. However, the pervasive correlations among real entities, are overlooked. It is all these correlations that make these entities form a huge correlation network explicitly or implicitly. These correlations cover different types of data from various sources to express knowledge vividly. While the research on data content itself is much thorough, there is a long way to study correlations among entities, which is much more valuable in real applications. In order to better capture these correlations, graphs are used to model entities and their complex relationships. They have the advantage of being able to keep all the information about an entity in a single vertex and show correlations by edges connected to it. Studying the analytical technologies of entities and their correlations based on graphs can further deepen the understanding of the real world.This thesis focuses on key technologies of entity-relation analysis, by introducing the graph theory and technologies into the research of entity-relation, for purpose of efficient retrieval and analysis. Concretely speaking, the main contributions in this thesis can be summarized as follows.(1) A multi-view correlation model is proposed based on the entity-relation;besides, an efficient algorithm for constructing the correlation model is presented.People perceive the real world from different viewpoints, generating different descriptions of entities. How to express correlations among entities comprehensively and completely? Traditional models, such as relation model, graph model and so on,just ignore the expressions and operations of entity-relation from different sources.Therefore, we need to study efficient portrayal mechanisms and effective operations to express these entities and their relations. Given the pervasiveness of entityrelation, the thesis first proposes a multi-view correlation mode. By analyzing the characteristics of multi-source datasets, the correlation pattern is proposed to describe the mode of actions between entities. And the feature correlation graph is proposed to model correlations between features. Then, by combining the concept of view and feature correlation graph, we build the multi-view correlation model from multiple sources. In order to perform queries and analysis with the model, a series of operators are studied. Moreover, for efficient construction of the model, we present an algorithm based on correlation paths, followed by an improved blocked-based algorithm taking feature indices into account. To generate the final view for comprehensive analysis on condition that there is no prior knowledge of entity-relation analysis, we study an unsupervised mechanism to fuse different views. Taking the information retrieval on multiple types of datasets, an example is given to illustrate the construction and analysis process. Both theoretical analysis and experimental results on real-world datasets verified the feasibility of the proposed model and algorithms.(2) A genetic-based entity clustering analysis algorithm considering attribute description and relationships is studied, which improves the quality and effectiveness of clustering.Entities are associated with rich descriptions. Efficient entity clustering analysis, which combines the attributes and relations, is of great value for analyzing the topological structure of related entities, executing function analysis and predicting behaviors. Most existing techniques just take entity content or correlation structures into account, making it unsuitable to process multiple types of entities and their interrelations. What’s more, the fact that entities can be interrelate with each other is neglected. By analyzing the information transmission mechanism in communication system, the thesis models the attributed entity clustering problem into a minimum description length problem, and calculates a cost function for optimization. After proving that the attributed entity clustering problem is a NPC problem, the paper proposes a genetic-based clustering method as a solution. An extended label propagation strategy is proposed for chromosome initialization, and a local mutation operation with decreasing description length is studied. Both of them are used to improve the genetic-based algorithm for attributed graph clustering. Specially, theoretic analyses indicate that the time complexity is only relevant to the size of graph and average degree of vertices. Extensive experiments on real-life datasets prove the effectiveness and efficiency of our proposed methods.(3) A relation reachability analysis method is given for large correlation graphs and multi-type entities, which improves the efficiency of relation analysis on large label-constraint correlation graphs.Relationships among entities are constrained by attributes or labels. It is of great significance to decide if there is any relationship conforming to the given label constraints between any two entities. Efficient relation reachability analysis provides a foundation for relation inference and predication. Unfortunately, little existing work has studied reachability with label constraints imposed on edges/vertices of graphs. Furthermore, existing label-constraint methods are not enough efficient for large datasets. In this paper, we study the problem of answering label-constraint reachability(LCR) in a labeled directed graph, i.e., whether there exists a directed path confined to a given label set from a source vertex to a target vertex in the input graph. We propose a method to speed up the LCR computation, based on the recursive partition and independent property. The algorithm keeps the label-reach maintenance while partitioning the graph, and produces a compressed 2-hop index based on greedy expansion of labeling. Then, the label-constraint reachability can be answered with the 2-hop index. Extensive theoretic analysis and experimental investigations on both real and synthetic datasets demonstrate the efficiency of both index construction and query processing.(4) Given the fact that there are weights on edges to describe the degree of inter-relation, a weight constraint pattern matching method is proposed to find meaningful combinations satisfying users’ interests.Traditional analytical and retrieval methods ignore the interrelations between entities, and the query input cannot express users’ intention completely. For correlation model constructed from real-world applications, the strengths of correlation are described by weights on edges. According to users’ query, only the weight on edges are considered can we get a meaningful results combing entities and relations.For weighted directed graphs, the paper propose a weight constraint graph pattern matching method. Given a user-defined query pattern, it aims to find all the meaningful substructures in the data graph. The basic idea behind our algorithm is to get all-pair weight constraint reachability based on 2-Hop labeling, with a Dijkstralike strategies. Furthermore, the termination and filtration strategies are adopted to refine the initial matching set. Compared with typical existing pattern matching methods, experiments on both synthetic and real-world datasets show that our algorithm can achieve better effect.(5) A prototype supporting query and analysis of multi-type entities is designed.We designed a prototype system supporting retrievals and analysis of entityrelation from multiple sources. On one hand, it verifies the feasibility and practicability of the proposed methods. On the other hand, it exhibits the effect of entity-relation analysis, and demonstrates a potential application prospect. Upon the requirement analysis of several real examples, we design the logical and physical structure of the prototype system, and give a detailed description of some key function modules, especially the query and analysis engine, and the integrated visualization modules. Finally, by extracting entity-relations from real-world datasets,the proposed algorithms are verified and some related application examples are demonstrated.
Keywords/Search Tags:Entity-Relation, Correlation Analysis, Multi-View Correlation Model, Entity Clustering Analysis, Relation Reachability Analysis, Pattern Matching
PDF Full Text Request
Related items