Font Size: a A A

Design And Implementation Of Data Integration System Based-on Similarity Join

Posted on:2015-12-09Degree:MasterType:Thesis
Country:ChinaCandidate:Z Q PangFull Text:PDF
GTID:2308330482457265Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid development of IT, the Internet generates large amounts of data daily which need to be dealt with everyday. Many new frameworks based on distributed file system have been proposed to store massive data, and many programming models are also being applied to massive data.One of the most popular parallel programming model is MapReduce proposed by Google. Similarity join is one of the important operation in data integration andit refers to discover all entity pairs whose similarities are higher than the given threshold from a group of datasets. In data integration, similarity join is used for data cleaning, data deduplication, and entity resolution. However, with the daily increment of the data scale, detecting such similar pairs is a great challenge today, as the reason is that the increasing trend of applications being expected to deal with vast amounts of data usually do not fit in the main memory of one machine. Similarity join can be processed in parallel model, so we adopt MapReduce to handle the problems of similarity join within the large dataand promote the computational efficiency.This thesis focuses on massive data sources, and implemented the big data integration system based on similarity join to integrate massive data effectively. The system is achieved through MapReduce framework and handle entity recogniz ation with three core phases.In the first phase, our system is capable of discovering all similar entity pairs from a group datasets which is based on techniques for similarity joins.Then, the system divides all similar entity pairs to get all similarity sub graphs.Finally we complete the entity recognition process by sampling on the sub-graph. In this thesis, we mainly research on the similarity joins in distributed model and the task optimization in MapReduce framework. For similarity joins using MapReduce framework, we propose all prefix filter algorithm, extend suffix filter algorithm based on prefix and position informationand design an pipeline hybrid filter framework to promote the efficiency by reducing the number of candidate pairs. For MapReduce task, this thesis shows that the MapReduce task has been optimized from two aspects:to reduce network cost between cluster nodes through data compression and to improve the efficiency of parallel tasks through task load balancing. Finally, we implement the intellectual property search prototype system based on big data integration.In order to verify the efficiency of similarity join algorithms and scheduling strategies proposed in this thesis, we conduct extensive experiments using the real datasets from DBLP and Citeseerx.With plenty of the experiments, we compare the time cost about different similarity join algorithms. The experimental results candidate that with the increment of data scaleour similarity filter framework and load balancing algorithm show more significant advantage. The big data integration system proposed in this thesis provides a user-friendly interface to help users executing data integration based on similarity join for massive datasets.
Keywords/Search Tags:similarity join, entity resolution, MapReduce, data integration, load balance
PDF Full Text Request
Related items