Font Size: a A A

Research On String Similarity Join Method Based On Hadoop Platform

Posted on:2018-04-08Degree:MasterType:Thesis
Country:ChinaCandidate:L L XiaFull Text:PDF
GTID:2348330536452510Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the widely used and rapid development of internet technology such as e-commerce,social network and cloud compute,the amount of data has increasing dramatically,and large-scale data processing has become one of the hot issues.String similarity join which has a wide range of applications in text retrieval,biological monitoring,information preocssing,pattern recognition,data integration and cleaning and other fields is the basic operation of data processing.There are many methods based on character similarity measurement,including editing distance,Jaccard similarity and Cosine similarity.The article mainly study the Jaccard similarity.There are two types of methods for string similarity join: traditional string similarity join methods and string similarity join based on distributed frameworks.Traditional string similarity join methods are ALL-pairs,Ed-join and Trie-tree,etc.The string similarity methods join based on the distributed framework are MRSimJoin,MR_DSJ and Fuzzy-Join,etc.The traditional string similarity join methods which is limited by the machine memory space,external memory space and CPU and so on studied and analyzed.It is not suitablefor large-scale data similarity join,but using Hadoop distrituted framework is one of the main ways for processing large-scale data.Therefore,the article studies how efficient and parallel processing string similarity join based on Hadoop distributed framework.The following are the main contributions of the article:(1)A similarity join model named SSJ-Model which uses multiple filtering strategies and can incrementally process string similarity join is proposed by the article.(2)A algorithm Hmrdp-join based on Hadoop distributed framework is proposed by using SSJ-Model and studying the operating principle of Hadoop distributed framework.(3)The Hmrdp-join algorithm is optimized.The optimized algorithm can save some temporary results in the MapReduce phase and avoid the time cost from copying of the disk.The optimized algorithm divides the data more efficiently and balances the load of the map phase and the reduce phase.The optimizedalgorithm avoids the repeated calculation in the similarity join by using the existing information,and make use of the grouping strategy to reduce the duplication of the string.(4)Experiments were carried out on the real data set and the experimental results were analyzed to prove the efficiency of the optimized Hmrdp-join algorithm.
Keywords/Search Tags:string similarity join, Hadoop, optimization, large-scale data, MapReduce
PDF Full Text Request
Related items