Font Size: a A A

The Research Of Similarity Join Based On MapReduce

Posted on:2015-11-18Degree:MasterType:Thesis
Country:ChinaCandidate:Y Y XuFull Text:PDF
GTID:2298330422493080Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the wide application and the development of information technology, such as the socialnetwork, mobile applications and online services, which cause the data increasing explosively, theanalysis of huge data needs powerful computing ability. As a basic operation of data analysis,similarity join can be used to improve the computing speed and computational efficiency greatlyon similarity search and data mining. Similarity join is similar to database connection; thedifference is that for different data types, similarity join adopts the corresponding measurefunctions and different threshold limited, through the measure function and then the correspondingjoinoperation.The processing capacity of a single computer and traditional technology architecture has beendifficult to meet the calculation requirements of massive data; the appearance of mapreducebrings light for the similarity join. Now, Similarity join technology in mapreduce has achievedgood results, but there are some problems: the processing speed is not fast enough; the processingdata type is relatively single, dealing with the dynamic data less effectively, and so on. Aiming atthe problem of the processing speed, this paper proposes the improved partition-based algorithmand the improved prefix-filtering-based algorithm that can improve the computation efficiency ofsimilarityjoin. Thefollowingisthemain researchcontent:1. Adopting the strategy of divide and conquer, taking example by QuickJoin algorithm, thispaper proposes the improved partition-based algorithm (MR_MinPrefix algorithm), whichdecomposes the massive data into several smaller data set, and spreads them to the mapreducedistribution cluster, thendoes the corresponding similarity join operations.Themain contentare:(1)before the partition operation, sampling the original data set, calculating the effective clustercenters (or central), then partitioning the original data set by making use of the effective centralforming the partition(its size is not over the block size that a single mode can calculate). And inorder to use the data from the data’s calculation process effectively, avoiding the repeatedcalculation. In this paper, it saves the intermediate data by indexing technology, which is thatestablishing the K-D tree index for the partition whose size meets the requirement, then gets all similarity pairs. The experiments show that this method can reduce the numbers of partition thedata, and decrease the frequency of verification, comparing to the previous algorithm its runningefficiency has obvious improvement.(2) In real world, there is dynamicdata widely, treating part ofthe data as the incremental data on original data, aiming at the new data’s similarity operation, thispaper sets the corresponding allocation principle to find the corresponding partition for every newdata,eventually obtainsnewdata’s similaritypairs.2. In investigating near-duplicated web pages, blocking malicious AD, recommending similarusers, and so on, set similarity join technology is widely used. Typically set similarity jointechnology used the filter-verify computing framework, through the prefix filtering technology toshorten the candidate set list, but under the mapreduce platform these algorithms generate a largenumber of candidate set, and increase the time of verifying. This paper proposes aprefix-filtering-based similarity join algorithm in mapreduce(MRSJ_PDS algorithm), which usesthe min-prefix filtering technology to prune the index of tokens effectively, then saves the relativeinformation of records to the specified file which is useful for the coming data. For the incrementaldata’s similarity join operation, it adopts the strategy of propagation delay, delays to update theglobal token frequency, index list and other relevant information, and finally obtains all similaritypairs.
Keywords/Search Tags:Similarity Join, MassiveData, MapReduce
PDF Full Text Request
Related items