Font Size: a A A

Research And Implementation Of Similarity Join For Big Data

Posted on:2016-02-16Degree:MasterType:Thesis
Country:ChinaCandidate:S Z DengFull Text:PDF
GTID:2428330542954607Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The rapidly developing computer technology and science has been stimulating the growth of the data,which not only leads to vast storage stress but also huge research value and commercial value.To meet the needs of the big data integration,researchers proposed the distributed file system to store vast amount of data.They also put forward different kinds of computation frameworks.Now Spark proposed by Berkeley is the best framework compared with others.A similarity join plays an important role in data integration and entity identification.Informally a similarity join takes more than two data sources or relations as input and detects all the record pairs which are greater than a threshold of similarity function.However,as a result of the explosive growth of datasets they usually do not fit in the main memory of one machine.To deal with this vast amount of data similarity join,we should make use of clusters in distributed computing environments.In this thesis we analyze existing similarity join based on the real datasets of thesis.Firstly we discuss the prefix-filtering using index and realize distributed similarity join on Spark.According to the characteristics of distributed computing models we propose an efficient solution named O-T to reduce the candidate size.At the same time we describe distributed RS Join based on Spark.transform AdapJoin from one machine model to distributed models and perform experiments to prove the efficiency.Secondly we analyze the characteristics of position filter and put forward PSJoin and PSJoin+ which both use the information of PS(short for prefix and suffix).In order to estimate the upper bound of similarity,PS makes use of the position of the common tokens both existing in the prefix of one record and suffix of another one.Experiments prove that the PSJoin and PSJoin+ on Spark have good performance.Lastly we have deeper understanding of the characteristics of token-weight filter.After analyzing the characteristics of token-weight computation and the relation between weight-filter and prefix-filter in detail,we come up with three filter methods based on token weight,namely WTBFilter WTFilter and WTPFilter,which can be also used in distributed computing environment and on Spark.Especially,WTPFilter relying on the value of weight reduces the candidate size efficiently.Both WTJoin and WTPJoin perform well in the experiments.All in all,we make full use of Spark to implement all the algorithms based on prefix filtering and prove their efficiencies in the experiments.
Keywords/Search Tags:Similarity Join, Spark, Distributed Computing, Filtering, Weight
PDF Full Text Request
Related items