Font Size: a A A

Entity Resolution Based On Crowdsourcing And Two-tiered Correlation Clustering

Posted on:2015-03-16Degree:MasterType:Thesis
Country:ChinaCandidate:J LiFull Text:PDF
GTID:2268330425488905Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In the real world, the same entity can be described by different records from multiple sources. The goal of Entity Resolution is to identify which records refer to the same real-world entity. Entity Resolution is a crucial step in the process of cleaning data and integrating data from multiple sources. The task of Entity Resolution is now commonly used for improving data quality and enriching data to facilitate more detailed data analysis. However, with the advent of big data era, varieties of data quality issues pose unprecedented challenges to Entity Resolution. As a consequence, the traditional Entity Resolution methods have a poor perfonnance on efficiency and effectiveness, especially, on the ability of noisy data immunity. In tenns of noise immunity, the traditional Entity Resolution methods often produce inconsistent judgments. To address this problem, the traditional approaches typically perfonn a transitive closure analysis on the matching pairs but neglect unmatching. Obviously, these approaches not only take sides in matching pairs, but also propagate error information.Correlation Clustering is a standard method to Entity Resolution, which takes all results as evidences and produces a clustering maximizing agreement with them. Correlation Clustering is NP-hard, and many heuristic algorithms have been proposed. However, these algorithms tend to be coarse since they only pursue mathematical guarantees. Therefore, in this paper, we propose an efficient and effective entity resolution algorithm with relatively strong noise immunity and scalability based on Correlation Clustering. The main work is as follows:(1) A novel two-tiered correlation clustering framework is proposed. In this framework, the top tier employs pre-partition algorithm which allows overlap clustering; the bottom tier uses adjustment algorithm to remove the overlaps brought by the top tier.(2) Introduced the concept of common neighborhood in Correlation Clustering problem and proposed the calculation method of neighborhood similarity based on neighborhood. This paper first analyzes how to represent a cluster using neighborhood and then a heuristic pre-partition algorithm is introduced.(3) Proposed the concept of kernel of a cluster and thereby defined the degree of association between nodes and clusters. The association degree between nodes and clusters defined by the kernel can be more accurate and therefore improving the accuracy of Entity Resolution. Based on this, we propose a heuristic adjustment algorithm.(4) Introduced the concept of crowdsourcing into the top tier pre-partition algorithm, which allows people to verify the pairs selected to produce clusters. While the top tier pre-partition algorithm sequentially produce clusters, we propose a parallel verification algorithm and an optimal one in order to shorten period of the crowdsourcing verification.Experimental results demonstrate that the Entity Resolution based on crowdsourcing and two-tiered Correlation Clustering proposed in this paper has a better performance on effectiveness, noise immunity and scalability than traditional algorithms.
Keywords/Search Tags:entity resolution, correlation clustering, neighborhood, crowdsourcing
PDF Full Text Request
Related items