Font Size: a A A

Research On Distributed Spatial Join Algorithms For Large Scale Data

Posted on:2022-06-09Degree:MasterType:Thesis
Country:ChinaCandidate:R B WangFull Text:PDF
GTID:2518306740462514Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the popularity of mobile devices and the development of satellite positioning system,massive spatial data are produced.Large scale spatial data contains rich value.Spatial data analysis and mining is a significant work.Spatial join is a basic operator of spatial data analysis,which has a wide range of application scenarios.However,the distributed implementation of this operation is not perfect.The idea of divide and rule is used to realize the distributed spatial join.Firstly,the whole spatial scope is divided into several small-scale spatial partitions,and then the data in each spatial partition is calculated in parallel on the distributed cluster by using the single-machine spatial join algorithm.However,the choice of space range in the existing technology is too large,resulting in too many invalid calculations.The division of spatial partition does not take into account the both spatial distribution of the two datasets,which leads to the problem of load balancing.There are still many areas to be optimized in the implementation details of parallel computing.In addition,the support for the types of spatial join and spatial data types is not perfect.Based on these,this thesis makes a comprehensive and detailed research on distributed spatial join:(1)A distributed spatial distance join algorithm is proposed.Firstly,the range of the whole spatial area is narrowed,and the invalid data that does not contribute to the final result is filtered efficiently.Secondly,considering the spatial distribution of the two datasets,the two datasets are used to divide the global domain,and the two spatial partitions are obtained and combined to generate a spatial partition set that takes into account the spatial distribution of the two datasets,so as to achieve load balancing in distributed computing.In addition,a special optimization for spatial distance self-join is made.Finally,a comparative experiment is carried out with the global spatial data,and the experimental results show that the performance of the proposed spatial distance join is better than the existing technology.(2)A distributed spatial k-nearest neighbor join algorithm is presented.Firstly,two rounds of computing scheme of k-nearest neighbor join are given.The minimum expansion distance of space object is obtained in the first round,and the exact join result is obtained in the second round.Then the shortcomings of the two rounds of calculation are analyzed,and a reasonable optimization strategy is given,which greatly reduces the data transmission through networks and unnecessary calculation.Finally,a comparative experiment is done based on the global spatial data.The experimental results show that the performance of the proposed spatial k-nearest neighbor join algorithm is better than the existing technology,and the effect of the proposed optimization strategy is obvious.The proposed k-nearest neighbor join supports all types of spatial data,and has strong versatility.(3)Based on Spark distributed computing framework,the proposed algorithm is implemented and packaged as API.Firstly,the proposed distributed spatial join algorithm is implemented by using the API provided by Spark.Then,the code implementation is encapsulated as an API for third-party use,including RDD encapsulation based on Spark Core and SQL statement encapsulation based on Spark SQL.
Keywords/Search Tags:Distributed Computing, Spatial Join, Spatial Partition, Spatial Data, k Nearest Neighbors
PDF Full Text Request
Related items