Font Size: a A A

Research On Key Techniques Of Entity Resolution For Big Data Integration

Posted on:2015-02-22Degree:MasterType:Thesis
Country:ChinaCandidate:W J LiFull Text:PDF
GTID:2308330482960235Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The Internet generates large amounts of data daily which need to be dealt with everyday. Many new frameworks of distributed storage systems have been proposed, and the parallel computing models based on them are also being researched on. MapReduce and Pregel based on BSP are the most typical models, proposed by Google. With these, many problems produced by big data can be efficiently worked out in a large scale of cluster servers now. Entity resolution refers to the task of finding records that present the same entity across different data sources. It is also called record linkage. In data integration, entity resolution is used for de-duplication of data cleaning and similarity joins between different datasets. It is widely used in different applications such as census population, references recognition, web search, data cleaning, plagiarism checking and so on. However, with the increment of the data scale, it exits bottleneck in dealing entity resolution with hundreds of GigaBytes data size on a single machine. It is also impossible to handle the data in TBytes and PBytes scale. Entity resolution can be processed in parallel model, so we can adopt MapReduce and BSP to handle the problems of entity resolution with the large scale of data, and promote the efficiency.This thesis proposes the entity matching strategy based on MapReduce and the similar subgraph building strategy based on BSP model with the study of key technologies of entity identifying. Entity resolution can be divided into two phases, entity matching and entity merging. Entity matching finds all pairs of similar records which meet the threshold from the data source. Entity merging divides all pairs of similar records from entity matching into similar subgraph and merges all records of the same similar subgraph. For entity matching, this thesis proposes new methods based on mapping table and binary searching on the basis of PPJoin algorithm. By using mapping table and binary searching instead of inverted list, the new methods accelerate the verification of similarity between records and improve the efficiency of matching records with keeping the original filtering effect. For similar subgraph building, this thesis proposes new methods based on BSP. The new methods replace the job iterations by using superstep iterations, reduce the number of iterations by using asynchronous communication, and control the iterations by controling the number of nodes, so the efficiency of building similar subgrph is imporved.Our experiments are executed on Hadoop and Hama with using the real datasets from ACM and DBLP. For entity matching, we compare the algorithms based on mapping table and binary searching and the algorithm of PPJoin on Hadoop. The experiment results show that the algorithms based on mapping table and binary searching have been greatly improved in performance, also the stability under different threshold is obvious. For similar subgraph building, we compare the algorithms based on BSP and the algorithm based on MapReduce on Hama and Hadoop. The experiment results show that the algorithms based on BSP provide better performance compared with that based on MapReduce.
Keywords/Search Tags:entity resolution, MapReduce, BSP, entity matching, entity merging, similar subgraph
PDF Full Text Request
Related items