Font Size: a A A

Research On Efficient Storage,Query And Cluster Analysis Of Massive Spatiotemporal Data

Posted on:2022-05-25Degree:MasterType:Thesis
Country:ChinaCandidate:S K GuoFull Text:PDF
GTID:2518306605967619Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
In recent years,with the development of big data technology,the spatio-temporal data collected by people has also increased in a spurt.Although the spatio-temporal big data management platform has initially solved the problems caused by the huge scale of data,there are still shortcomings.In terms of storage and query,most of the indexes used in spatiotemporal data are extended R-tree structures,which have the problem of slow index update when the amount of data written is huge,thus directly affecting the writing and query efficiency.In terms of clustering analysis,because the traditional DBSCAN(Density-Based Spatial Clustering of Application with Noise)algorithm is a single-machine serial mode,so when dealing with a huge amount of data,clustering will be inefficient or even uncomputable.To address the above problems,the general work of this paper is summarized as follows.Firstly,in order to solve the problem of ID-temporal query,ID-spatial query and ID-spatiotemporal query of spatio-temporal data,combined with S2 space dimensionality reduction algorithm and HBase database,this paper first designs the overall load balancing scheme,and then proposes three storage models and query schemes different from R-tree structure.In scheme one,the row key is well designed,and the HBase scan technology is used to speed up the reading and writing of spatio-temporal data,which efficiently complete the data query of ID-temporal query.Schemes two and three combine the S2 algorithm with the design of HBase Row Key to efficiently complete ID-spatial query and ID-spatio-temporal query.Finally,each solution was experimented and verified.For load balancing,the number of writes and the size of the generated storage files are basically the same for each HBase Region,which accomplishes the design goal.For scheme effectiveness,this paper plots the query results on the AMap and learns from the rendering effect that the scheme is correct.In order to verify the efficiency of each scheme,for each scheme,we have implemented an alternative to My SQL,and then compared the writing and query efficiency.The final result shows that the writing and query efficiency of any scheme are both better than the scheme of My SQL.In terms of write performance,the average write efficiency per 100,000 pieces of data for scheme 1,2 and 3 is 3.3,23.2,and 15.4 times that of the My SQL scheme,respectively.Given the query performance under the same conditions,the query efficiency of schemes 1,2,and 3 are 3.2,5.2,and 3.4 times that of the My SQL scheme,respectively.Secondly,on this spatio-temporal management platform,for the clustering problem of massive data,this paper combines Spark computational engine to parallelize the DBSCAN algorithm in a distributed manner.It not only breaks the bottleneck that a single machine cannot complete the clustering of massive data,but also improves the computational efficiency.In this paper,the algorithm performance is evaluated in a real distributed cluster environment.Experiments show that the efficiency of Spark DBSCAN algorithm can be improved by 160.48 times when comparing with DBSCAN algorithm in a single-computer environment with 32 CPUs and 2.56 million data;in a distributed cluster environment composed of3 servers,Spark DBSCAN algorithm can not only complete the clustering calculation of 4million data when comparing with the pseudo-distribution mode.Moreover,the algorithm efficiency can be improved by 5.02 times when the number of input data is 2.56 million.In addition,this paper also analyzes the complexity of the algorithm and gives specific suggestions for parameter tuning with experiments.Thirdly,in this paper,we use the distributed parallelized DBSCAN algorithm to cluster and analyze 457776 boarding data of Chengdu cabs,and show the clustering results on a map.The experimental results show: 1)the clusters involving the widest spatial range and the most data are near Chunxi Road;2)clusters are generated in densely populated places such as airports,hospitals,schools and shopping malls,and on the contrary,fewer clusters or even no cluster appear in places with relatively less traffic such as residential areas.The experimental findings are consistent with expectations,indicating the effectiveness and feasibility of the scheme.In addition,compared with the empirical judgment of cab drivers,this method can accurately output the data of boarding points over a period of time,including the location and number of boarding points,which can better assist cab operations.
Keywords/Search Tags:Spatio-temporal big data, HBase, DSBCAN, Spark
PDF Full Text Request
Related items