Research On String Similarity Join Method Based On Hadoop Platform

Posted on:2018-04-08

Degree:Master

Type:Thesis

Country:China

Candidate:L L Xia

Full Text:PDF

GTID:2348330536452510

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

With the widely used and rapid development of internet technology such as e-commerce,social network and cloud compute,the amount of data has increasing dramatically,and large-scale data processing has become one of the hot issues.String similarity join which has a wide range of applications in text retrieval,biological monitoring,information preocssing,pattern recognition,data integration and cleaning and other fields is the basic operation of data processing.There are many methods based on character similarity measurement,including editing distance,Jaccard similarity and Cosine similarity.The article mainly study the Jaccard similarity.There are two types of methods for string similarity join: traditional string similarity join methods and string similarity join based on distributed frameworks.Traditional string similarity join methods are ALL-pairs,Ed-join and Trie-tree,etc.The string similarity methods join based on the distributed framework are MRSimJoin,MR_DSJ and Fuzzy-Join,etc.The traditional string similarity join methods which is limited by the machine memory space,external memory space and CPU and so on studied and analyzed.It is not suitablefor large-scale data similarity join,but using Hadoop distrituted framework is one of the main ways for processing large-scale data.Therefore,the article studies how efficient and parallel processing string similarity join based on Hadoop distributed framework.The following are the main contributions of the article:(1)A similarity join model named SSJ-Model which uses multiple filtering strategies and can incrementally process string similarity join is proposed by the article.(2)A algorithm Hmrdp-join based on Hadoop distributed framework is proposed by using SSJ-Model and studying the operating principle of Hadoop distributed framework.(3)The Hmrdp-join algorithm is optimized.The optimized algorithm can save some temporary results in the MapReduce phase and avoid the time cost from copying of the disk.The optimized algorithm divides the data more efficiently and balances the load of the map phase and the reduce phase.The optimizedalgorithm avoids the repeated calculation in the similarity join by using the existing information,and make use of the grouping strategy to reduce the duplication of the string.(4)Experiments were carried out on the real data set and the experimental results were analyzed to prove the efficiency of the optimized Hmrdp-join algorithm.

Keywords/Search Tags:

PDF Full Text Request

Related items

1	The Research And Application Of Parallel Similarity Join For Large-scale Strings
2	Join Processing And Optimizing On Large Data Sets Based On Hadoop Framework
3	Research On String Edit Similarity Join
4	Research On Improvement Of Similarity Join In MapReduce
5	Optimizing Top-k String Similarity Join Algorithm
6	Research On String Similarity Join Algorithm
7	Research And Application On Distributed Parallel String Similarity Join
8	Design And Optimize Big-Data Join Algorithms Using MapReduce
9	Research On Complex Distance Measure Based MapReduce Similarity Join Techniques
10	Optimum Design Of Table Join Algorithm Based On MapReduce