Research And Implementation Of Similarity Connection Algorithm For High-dimensional Data Based On Spark

Posted on:2020-12-18

Degree:Master

Type:Thesis

Country:China

Candidate:X H Cheng

Full Text:PDF

GTID:2438330572987384

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

The similarity join of high-dimensional data is a vector pair with a distance less than a given threshold calculated according to the distance formula in a given data set with higher dimensions.It is widely used in image similarity matching,text clustering,and friend recommendation.However,with the development of information technology,the amount of data has increased dramatically,and the similarity join of high-dimensional data faces a lot of challenges.Therefore,the study of similarity joins of high-dimensional data helps to improve the efficiency of related applications.Through the research on the existing high-dimensional data similarity join algorithm,we find that many algorithms have problems such as data redundancy,repeat calculations,more resource utilization,and the experimental results are not ideal.In order to solve the above problems,we present the algorithm SAVD in this paper.It combines piecewise aggregate approximation,symbolic aggregate approximation,and vertically decomposed data techniques.Its idea is to first represent the standardized data with PAA and SAX,then vertically decompose it and calculate the vertical partition to obtain the candidate sets by the filtering method proposed in this paper.Finally,we aggregate the candidate sets of each partition to find all the result sets that satisfy the distance requirement.The method solves the problems existing in the existing work and improves the execution efficiency of the algorithm.At the same time,we optimize the algorithm and propose to filter out unnecessary inter-vector calculations by comparing triangle inequalities,which can greatly improve the execution efficiency and reduce the complexity of the algorithm.In order to verify the efficiency of the proposed algorithm,we implemented the algorithm in MapReduce and Spark frameworks respectively,and compared it with the existing algorithms on the published dataset.The experimental results show that the proposed method is more efficient than the existing methods.In addition,for the problem of increasing data in existing application scenarios,we have extended the SAVD algorithm in incremental high-dimensional data sets.First,for the original data,we store the result data set using SAX dimension reduction and vertical decomposition to the specified location.Then,we use the same standard to reduce the dimension and vertical decomposition of the incremental data set,and aggregate the result set with the intermediate output of the original data.Finally,we calculate the self-join of the incremental data set and the similarity join with the original data set.The experimental results show that the proposed incremental dataset calculation method has high er performance advantages than the direct similarity join algorithm.

Keywords/Search Tags:

High-dimensional data, Similarity join, Piecewise aggregate approximation, Symbolic aggregate approximation, Vertically decomposed

PDF Full Text Request

Related items

1	Study Of Symbolic Aggregate ApproXimation For Time Series Classification
2	Aggregate Queries On Constrained Probabilistic Similarity Join Pairs
3	Provably Secure Aggregate Signature And Application
4	Data Stream Query Operator Algorithm
5	Design And Analysis Of Aggregate Signature Scheme Resistant To Collusion Attack
6	Time Series Similarity, Aggregate Top-k Query Algorithms And Applications
7	Research On Provably Secure Aggregate Signature Schemes And Their Applications
8	Research And Implementation Of The Aggregate-Join Query Optimization Approach Based On Mapreduce
9	Research On Vector Approximation Method In High-dimensional Index Technology
10	Research On Key Techniques Of High Performance Spatial Query Processing For Large Scale Spatial Data