Font Size: a A A

Random-Based Distributed Entity Matching Technology

Posted on:2016-08-07Degree:MasterType:Thesis
Country:ChinaCandidate:P F ChaoFull Text:PDF
GTID:2308330461475933Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Entity matching aims at identifying records that represent the same entities or real-world object. It can improve the data quality by filtering and merging multiple records referring the same entity, and therefore help promoting the performance of data analysis and management. As an important technique of data integration, entity matching has been studied for decades. The traditional entity matching research focus on improving match-ing accuracy and efficiency of structured data. Most of the existing work improves the matching accuracy by optimizing data tagging, feature extraction and similarity metric-s, and enhances matching performance by adding prefixes and suffixes filters and block strategies. But most of research in recent years still focused on matching structured da-ta. As data structure becomes sparser, noisier and the proportion of unstructured data increases, the performance of the tradition techniques are constrained; on the other hand, with the explosive increase of the data on Internet in recent years, traditional centralized entity matching approaches are limited by the performance bottleneck of single computer, which is unable to meet the current requirement of data processing. Therefore, it becomes a hotspot studying distributed entity matching on unstructured data.Distributed system is excellent on parallel computation and scalability. It can deal with the rapid expansion of data processing needs by adding processing nodes. By the hand of the advantage of distributed system, entity matching can achieve a higher level of data matching. However, there are two factors that can affect the matching performance on distributed system:the load balancing and the network transmission overhead. Load balancing is a widespread problem in distributed system. It is caused by the uneven delivery of data to every node which result in a reduction of task parallelism. Network transmission overhead happens during the data transmission step. If a huge amount of data go through the network, it will result in a severe network transmission delay, so that the task execution time is extended.There is no existing entity matching method based on distributed system can solve both of these two issues mention above. Moreover, most of the blocking-based matching method may produce duplicate records, causing the redundant calculation problem. In addition, most existing research focuses on matching entity of structured data, but cannot handle unstructured data efficiently. In order to solve these problems, we propose a high-speed entity matching method based on randomized algorithms to solve the matching problem of unstructured data efficiently; In addition, we propose a distributed framework for random-based distributed entity matching, while solving the load balancing and net- work transmission overhead problems. Subsequently, we propose two redundancy elimi-nation methods to solve the redundant computation problem. We integrate those methods into our matching framework.The main contribution of our work is as follows:·High-speed random-based entity matching for unstructured data We propose a high-speed entity matching method for matching unstructured data. Initially, after a dimensionality reduction using Locality Sensitive Hash method, the original entity data is compressed into low-dimensional feature vectors. In order to improve the matching precision, we random permute the low-dimensional feature vectors and re-sort the results, and pick up entity pairs through sliding window. By means of the dimensionality reduction using Locality Sensitive Hash and binary serialization, our method effectively reduces the computational complexity and network traffic overhead. Experiments show that our algorithm achieved by Efficient matching entity, while still maintaining a high accuracy.· Distributed random-based entity matching framework Based on the foregoing randomized algorithm, we propose a high-speed distributed entity matching frame-work for unstructured data based on the MapReduce distributed computing model. By using a randomized algorithm to dimensionality reduction and feature transfor-mation in map phase, and doing record extraction and comparison in reduce phase, we realize the high-speed matching method. The frame fully utilize the advan-tages of randomized algorithms, while addressing the load balancing and network transmission overhead issues to achieve high performance in a distributed envi-ronment. We show a bunch of detailed experimental analysis of the impact to the accuracy and efficiency of different parameters, and compare our method with other state-of-the-art matching algorithms, it is shown that our algorithm has a significant performance advantage.· Two options for redundancy reduction Aiming at solving the redundancy prob-lem occurred in our randomized algorithm, we propose two solutions based on the principles of accuracy priority and efficiency priority, respectively:By adding ad-ditional MapReduce task, we group the existing result set to eliminate the duplicate records and therefore remove redundancy precisely; By adding similarity check bit-s, and compare those bits during pair generation step to determine whether a record has been generated, we reduce the redundancy without introducing extra MapRe-duce tasks. It found that by comparing the advantages and disadvantages of the two methods, and compare our method with the existing blocking-based methods. The randomness of the algorithm leads to an uncertain redundancy elimination. We analysis and study this uncertainty and propose an universal solution model.
Keywords/Search Tags:Distributed System, Entity Matching, Locality Sensitive Hash, Random Algorithm, Redundancy Elimination
PDF Full Text Request
Related items