Font Size: a A A

Mapreduce-Based Entity Matching Technology

Posted on:2015-02-08Degree:MasterType:Thesis
Country:ChinaCandidate:J ChenFull Text:PDF
GTID:2268330431962958Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Aiming at finding matching records belonging to the same real-world objects, entity matching has been studied for decades. It is often applied in the scenario of data cleaning. Once data is consistent and accurate after matching and cleaning, it can better support specific business applications. The problems of entity matching can be concluded into two categories:how to effectively find the matching records and how to efficiently perform the matching task. However, with data volume increasing and data quality lowering recently, entity matching is encountered with big challenges in above two aspects.On one hand, how to find more matching records and avoid less non-matching records is quite important. Given a labeled data, classification models can be trained to improve the matching quality. Among these models, rule-based methods are quite fast, easy and plain, making it widely acceptable. Finding the proper rules, distance functions and thresholds is the core topic for rule-based methods. Our work can automatically find the proper rules, distance functions and thresholds without any human experience.On the other hand, how to efficiently execute the task of entity matching is also needed to consider. Pair-wise comparisons among record pairs are not necessary and blocking functions can take similar record pairs as candidates for further comparisons. Although using blocking functions can prune quite a lot of irrelevant record pairs, the number of candidate pairs is still huge enough. The MapReduce (MR) framework enables multiple map (/reduce) subtasks running in parallel to boost query processing. Thus it is essential to devise novel MR-based solutions to process entity matching. However, entity matching upon the MapReduce framework is non-trivial due to two inevitable challenges: load balancing and pair deduplication. We propose a novel general model, called MrEm, to support the MR-based entity matching with multiple blocking functions. Then we device specific techniques to handle above two challenges and integrate them into our model.In conclusion, the contributions of this paper are as follows:●Automatically learning for the rule-based methods. At first, we define a new concept based on F-score to measure the appropriateness of how to set the proper thresholds, how to choose the proper distance functions and rules. Meantime, we define how to measure the distance between two records under the specific rules. Following, We propose a heuristic algorithm to find proper thresholds, distance functions and rules automatically, which overcomes a major challenge of the rule-based entity matching. Finally, the experimental results on real data sets illustrate the high effectiveness of our method.· MR-based entity matching with multiple blocking functions. First of all, we present two baseline solutions to highlight the challenges of handling multiple blocking functions. The first solution (Naive) considers neither pair deduplication nor load balancing, just dispatching all records to reducers according to the block-ing keys generated by blocking functions. In contrary, the second solution (Sim-ple) overemphasizes these factors:mappers generate and dispatch every record pair within the same block to reducers. However, the significant additional overhead such as I/O operators in the shuffle phase even decreases the overall performance. Following, we discuss the shortcomings of Dedoop since it will bring into the new problem of load balancing. Then, we present a novel generalized model (MrEm) to address both challenges. Our model contains four stages, including block building, interface implementation, pair verification, and pair cleaning(optional). Stages1,3, and4construct the mainstream workflow to process data, and Stage2is responsible for dealing with two challenges:load balancing and pair deduplication. Meantime, we propose two concrete implementations to support the framework:MrEm-SI and MrEm-NSI. Finally, we conduct a series of experiments on real and synthetic data sets to illustrate high efficiency and scalability of the proposed solutions. We also discuss several critical parameters.
Keywords/Search Tags:Entity matching, Rule-based methods, Load balancing, Pair dedupli-cation, Distributed system
PDF Full Text Request
Related items