Font Size: a A A

Research And Application On Distributed Parallel String Similarity Join

Posted on:2018-11-02Degree:MasterType:Thesis
Country:ChinaCandidate:W J RuanFull Text:PDF
GTID:2348330536952507Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Similarity join is one of the basic operation in the field of data mining analysis,and have wide application significance in many fields such as data cleaning,bioinformatics and information integration.Similarity join can handle the type of join processing typically includes types such as strings,collections,vectors,and graphs.There are also several measures of how similarities between objects of different data types are measured,such as Jaccard Distance,Cosine Distance and Edit Distance.In this paper,we mainly focus on the study of the similarity join using the edit distance measured in the string,that is,to find all the strings in the set of query string satisfying the edit distance not less than the given threshold.At present,most of the similarity join algorithms for dealing with strings are single-machine-based memory algorithms,which require a lot of time when dealing with massive strings of data.The rise and application of distributed computing platforms provide conditions for efficient solution of massive string join.Based on the good scalability and fault tolerance of spark parallel computing framework,this paper transfers the standalone computation to cluster mode,and studies parallel string join.In this paper,first of all,based on the research of the related technology of the traditional string join,a parallel processing framework of string similarity join is designed,which is implemented in the distributed computing framework spark,and the process of parallelization is given and analyzed.Through the frequency vector information of the string,on the basis of the data division,effectively filter out the string does not meet the similar conditions,to avoid a large number of invalid calculations,through the experiment proved that data parallelization and parallel computing can effectively improve the massive Processing efficiency of string similarity join.Secondly,the similar join of the string is optimized,including the optimization of the parallelization algorithm and the optimization of the platform.Parallelization algorithm,the string of joint frequency vector broadcast to reduce the amount of data transmission during the connection process.Because of the characteristics of spark memory-based computing,data transmission in cluster environment is the bottleneck of spark computing platform.In this paper,the data localization in task scheduling policy is optimized and the communication overhead caused by data partitioning is reduced.At last,the problem of similar connection solving and processing is analyzed,and the parallel join algorithm proposed in this paper is applied to different practical applications to fully exploit the value of data.
Keywords/Search Tags:string similarity join, edit distance, parallel compute, spark
PDF Full Text Request
Related items