Font Size: a A A

Research And Implementation Of The Big Spatial Data Join Query Processing Algorithms In Cloud Environment

Posted on:2015-01-26Degree:MasterType:Thesis
Country:ChinaCandidate:Y ChenFull Text:PDF
GTID:2308330482957274Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of earth observation and location-based service technology, the volume of spatial data has greatly increased, which has become one kind of big data, and how to use cloud computing to effectively process spatial join query has been one of the hotspots in the field of spatial data management. Existing MapReduce-based spatial join query processing algorithm, such as PPBSM, has the problem of weak filtration and excessive replication and also needs duplication avoidance, which brings extra CPU and I/O costs. For this, according to the existing problem in the spatial join query, this article performes in-depth research and do work in the following aspects.Firstly, for the existing excessive replication and other issues in the MapReduce-based spatial join query algorithm, this article proposes two spatial join query optimization algorithm, namely filter based spatial join query optimization algorithm (FBSJ) and grid expansion based spatial join query optimization algorithm (GEBSJ). FBSJ algorithm uses the MBR of an internal shrunken grid to filter useless object which crosses multiple grid units, and so reduces the computing cost in the Reduce phrase. GEBSJ algorithm uses R-tree based grid index to filter useless join object, which also reduces some part of data replication. This algorithm uses a class of data preprocessing to expand the original grid and then build the grid index, and uses another class of data to search the grid index to replicate the data. So it can reduce the volume of data replication and need no duplication avoidance, optimizing the algorithm from the perspective of CPU and I/O separately.Secondly, this article proposes a kind of distributed R*-tree based spatial join query processing optimization algorithm. This article introduces distributed R*-tree and adapt it. It firstly followes the Hilbert space filling curve data coding and make spatial data uniform division, and then constructes local R*-tree index in each partition, finally it buildes a global index information file with the information of each local index. When executing join operation, if two types of data sets are indexed, it can generate the task queue through their global information file, and then the R*-tree based join query processing can be performed; if there is only one type of data index, then it can take advantage of the global index information to help to assign those data with no index to the Reducer where local indexes lie, and then perform R*-tree based join query processing. Because of the distributed index, compared with algorithm PPBSM which processes data with no index, the algorithm can improve the query performance by 38.5%.Finally, with a large number of experiments on real data sets and synthetic data sets, we test and analyze the proposed two types of spatial join query processing algorithm. Experiment results show that the proposed two join query optimization algorithms with no data index perform much better than PPBSM algorithm; The use of distributed index in spatial join query processing algorithm can greatly speed up spatial join query processing and could achieve good performance and adaptability.
Keywords/Search Tags:Spatial Join Query, Distrubuted R~*-tree, Replication Optimization, MapReduce, Cloud Computing
PDF Full Text Request
Related items