Font Size: a A A

The Research And Application Of Parallel Similarity Join For Large-scale Strings

Posted on:2017-04-11Degree:MasterType:Thesis
Country:ChinaCandidate:J WangFull Text:PDF
GTID:2308330503953783Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the development of the Internet, the information exchange becomes more frequent. The big information left people with a sense of helplessness. How to get the most needed content quickly and accurately from the massive information is a serious problem. String Similarity join technology is the most direct solution. The string similarity connection has the profound significance in the practical application. It has wide application in text retrieval, bioinformatics, signal processing, intrusion detection and other fields.In this paper, we focus on how to deal with the problem of large string similarity join, and propose two parallel solutions for this problem. Firstly, this paper deeply studies the technique of string similarity, which will measure the similarity of different methods. According to the different processing objects, all methods are divided into two classes. Then, the advantages and disadvantages of these algorithms are discussed, and the parallel join method of more efficient strings is proposed. The main contributions of this paper are as follows:(1) In this paper, we study the string similarity join technique. the existing methods has a very low efficiency in dealing with the large scale string, and they usually face out-of-memory errors.(2) A new method of parallel connection based on memory is proposed, which is Para-Join. First, the data set is partitioned into several disjoint subsets according to the interval-vector of each string. In order to realize the join between a single subset and two different subsets, this paper also proposes two algorithms Para-RR and Para-RS based on partition framework. Para-Join algorithm can not only guarantee the integrity of the results but also does not bring redundancy. It realizes the stirng similarity join by using multi-thread programming, and it can improve the efficiency of the join.(3) To address the problem of insufficient memory for the Para-Join algorithm. Based on Para-Join, this paper proposes a parallel connection algorithm based on Spark framework, which is Spss-Join, which makes up the deficiency of Para-Join. Spss-Join can automatically acquire token set and not need to explicitly point out the number of threads, which is more flexible and can be adapted to more application and environment. Spss-Join algorithm can effectively deal with large scale data.(4) A system prototype based on the Spark framework is designed and it combines the advantages of Para-Join and Spss-Join.Theoretical analysis and experimental results show that Para-Join is a more efficient algorithm, Spss-Join not only inherits the efficiency of Para-Join, but also makes it possible to deal with large scale strings...
Keywords/Search Tags:string, similarity join, parallel, multi-threading, spark
PDF Full Text Request
Related items