Research And Implementation Of Similarity Join For Big Data

Posted on:2016-02-16

Degree:Master

Type:Thesis

Country:China

Candidate:S Z Deng

Full Text:PDF

GTID:2428330542954607

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

The rapidly developing computer technology and science has been stimulating the growth of the data,which not only leads to vast storage stress but also huge research value and commercial value.To meet the needs of the big data integration,researchers proposed the distributed file system to store vast amount of data.They also put forward different kinds of computation frameworks.Now Spark proposed by Berkeley is the best framework compared with others.A similarity join plays an important role in data integration and entity identification.Informally a similarity join takes more than two data sources or relations as input and detects all the record pairs which are greater than a threshold of similarity function.However,as a result of the explosive growth of datasets they usually do not fit in the main memory of one machine.To deal with this vast amount of data similarity join,we should make use of clusters in distributed computing environments.In this thesis we analyze existing similarity join based on the real datasets of thesis.Firstly we discuss the prefix-filtering using index and realize distributed similarity join on Spark.According to the characteristics of distributed computing models we propose an efficient solution named O-T to reduce the candidate size.At the same time we describe distributed RS Join based on Spark.transform AdapJoin from one machine model to distributed models and perform experiments to prove the efficiency.Secondly we analyze the characteristics of position filter and put forward PSJoin and PSJoin+ which both use the information of PS(short for prefix and suffix).In order to estimate the upper bound of similarity,PS makes use of the position of the common tokens both existing in the prefix of one record and suffix of another one.Experiments prove that the PSJoin and PSJoin+ on Spark have good performance.Lastly we have deeper understanding of the characteristics of token-weight filter.After analyzing the characteristics of token-weight computation and the relation between weight-filter and prefix-filter in detail,we come up with three filter methods based on token weight,namely WTBFilter WTFilter and WTPFilter,which can be also used in distributed computing environment and on Spark.Especially,WTPFilter relying on the value of weight reduces the candidate size efficiently.Both WTJoin and WTPJoin perform well in the experiments.All in all,we make full use of Spark to implement all the algorithms based on prefix filtering and prove their efficiencies in the experiments.

Keywords/Search Tags:

PDF Full Text Request

Related items

1	DV-Join:A Novel Method Based On Dynamic Indexing And VMC-Filtering For Similarity Join
2	Implementation And Evaluation Of Big Data Parallel Join Algorithms
3	Research And Application On Distributed Parallel String Similarity Join
4	Optimization Scheme And Implementation Of Join Operation In Spark Computing Engine
5	The Research And Application Of Parallel Similarity Join For Large-scale Strings
6	A Study On Spark-based Distributed Collaborative Filtering And Its Tools
7	Research On Query Analysis And Optimization Based On Spark System
8	Research And Implementation Of Multi-Way Join Query Processing Algorithms Over Big Spatial Data In Cloud Environment
9	Implementation And Optimization For Join Operation In Spark
10	Reseach On Optimizing Top-k Join Queries Based On Spark