Font Size: a A A

Research And Implementation Of Similarity Connection Algorithm For High-dimensional Data Based On Spark

Posted on:2020-12-18Degree:MasterType:Thesis
Country:ChinaCandidate:X H ChengFull Text:PDF
GTID:2438330572987384Subject:Computer technology
Abstract/Summary:PDF Full Text Request
The similarity join of high-dimensional data is a vector pair with a distance less than a given threshold calculated according to the distance formula in a given data set with higher dimensions.It is widely used in image similarity matching,text clustering,and friend recommendation.However,with the development of information technology,the amount of data has increased dramatically,and the similarity join of high-dimensional data faces a lot of challenges.Therefore,the study of similarity joins of high-dimensional data helps to improve the efficiency of related applications.Through the research on the existing high-dimensional data similarity join algorithm,we find that many algorithms have problems such as data redundancy,repeat calculations,more resource utilization,and the experimental results are not ideal.In order to solve the above problems,we present the algorithm SAVD in this paper.It combines piecewise aggregate approximation,symbolic aggregate approximation,and vertically decomposed data techniques.Its idea is to first represent the standardized data with PAA and SAX,then vertically decompose it and calculate the vertical partition to obtain the candidate sets by the filtering method proposed in this paper.Finally,we aggregate the candidate sets of each partition to find all the result sets that satisfy the distance requirement.The method solves the problems existing in the existing work and improves the execution efficiency of the algorithm.At the same time,we optimize the algorithm and propose to filter out unnecessary inter-vector calculations by comparing triangle inequalities,which can greatly improve the execution efficiency and reduce the complexity of the algorithm.In order to verify the efficiency of the proposed algorithm,we implemented the algorithm in MapReduce and Spark frameworks respectively,and compared it with the existing algorithms on the published dataset.The experimental results show that the proposed method is more efficient than the existing methods.In addition,for the problem of increasing data in existing application scenarios,we have extended the SAVD algorithm in incremental high-dimensional data sets.First,for the original data,we store the result data set using SAX dimension reduction and vertical decomposition to the specified location.Then,we use the same standard to reduce the dimension and vertical decomposition of the incremental data set,and aggregate the result set with the intermediate output of the original data.Finally,we calculate the self-join of the incremental data set and the similarity join with the original data set.The experimental results show that the proposed incremental dataset calculation method has high er performance advantages than the direct similarity join algorithm.
Keywords/Search Tags:High-dimensional data, Similarity join, Piecewise aggregate approximation, Symbolic aggregate approximation, Vertically decomposed
PDF Full Text Request
Related items