Font Size: a A A

Research On Entity Resolution Framework And Key Techniques For Big Data Integration

Posted on:2014-05-13Degree:MasterType:Thesis
Country:ChinaCandidate:M WangFull Text:PDF
GTID:2348330473453885Subject:Computer technology
Abstract/Summary:PDF Full Text Request
The Internet generates large amounts of data daily which need to be dealt with everyday. Many new frameworks of distributed storage systems have been proposed, and the parallel programming models based on them are also being researched on. One of the most typical parallel programming frameworks is MapReduce, which is proposed by Google. With the MapReduce framework, many problems produced by big data can be efficiently worked out in a large scale of Cluster Servers now. Entity resolution refers to the task of finding records that refer to the same entity across different data sources. It is also called record linkage or entity resolution. In data integration, entity resolution is used for data cleaning, de-duplication and similarity joins. It is widely used in different applications such as census population, references recognition, web search, data cleaning, plagiarism checking and others. However, with the daily increment of the data scale, it exits bottleneck in dealing entity resolution with hundreds of GBytes data size on a single machine. It is also impossible to handle data scale in PBytes and TBytes. Entity resolution can be processed in parallel model, so we adopt MapReduce to handle the problems of entity resolution with the large scale of data, and promote the efficiency.This thesis proposes the VEER framework based on MapReduce to handle entity resolution with three steps. In the first step, given a group of datasets, VEER is capable of discovering all entity pairs whose similarities are higher than the given threshold. This work is based on algorithms and techniques for similarity joins. Then, VEER computes all similarity sub graphs from all pairs of similar records with MapReduce algorithms in the second step. Finally, we propose multiple algorithms that are used in the third step for merging records in the same similarity graph. With this framework, VEER allows users start their entity resolution tasks from different states of data, such as original dataset, record pairs data, or even similarity graphs. We implemente VEER based on Hadoop. To promote the efficiency for handling the big data, we mainly research on the similarity joins on distributed model and the task scheduling strategies in our framework. For similarity joins based on distributed model, we propose the filter algorithms based on prefix and position information that promote the efficiency by reducing the number of compared pairs. For the different stages of our framework, we propose different schedule algorithms for load balance and sub-graph built and these methods are all effective. In this thesis, we implement the prototype system based on Hadoop and VEER framework. Finally, this prototype system provides a user-friendly interface to help users executing their similarity tasks efficiently and promote the efficiency of server cluster.The similarity join algorithms and scheduling strategies proposed in this thesis are used in the VEER system. We use the real datasets from DBLP and CiteseerX in our experiments. With plenty of the experiments, we compare the different similarity join algorithms on the time cost and the utilization ratio of the cluster. The experimental results show that these algorithms can stably running in a cluster, each single node can finish the task in almost the same time, and with the increment of data scale, our algorithms show more significant advantage than existing approaches. We also provide a user-friendly interface to help user executing their tasks and viewing result of entity resolution.
Keywords/Search Tags:entity resolution, similarity metric, MapReduce, load balance, similarity joins
PDF Full Text Request
Related items