Font Size: a A A

Research On Spatial Parallel Computation And Adaptive Parameter Tuning Based On Spark

Posted on:2020-07-03Degree:MasterType:Thesis
Country:ChinaCandidate:X PengFull Text:PDF
GTID:2480305897967529Subject:Cartography and Geographic Information System
Abstract/Summary:PDF Full Text Request
Big Data is a quite popular research issue in recent years,and has been widely used in many other subjects.The focus in GIS industry is on spatial-temporal Big Data.An essential problem in mining and utilizing spatial-temporal Big Data would be how to support spatial Big Data computation.As is known to us all,the popular Big Data frameworks,such as Hadoop,Spark,and Storm,are not in favor of the processing of spatial data,while others that are built on top of Hadoop or Spark may suffer from the incompleteness of the support of spatial computation method or the inefficiency of spatial computation,such as GeoSpark,Geo Mesa,Spatial Hadoop,Hadoop GIS,etc.Hence,we explore the systematization of spatial computation based on Spark and GeoSpark,discuss some Spark parameter tuning,optimize the realization of spatial computation in GeoSpark,and propose a solution to Spark adaptive parameter tuning due to the intricacy of parameter tuning.The details are as follows:1)Discuss the tuning of some important parameters including memory size,number of cores,partition number and type of data persistence,and evaluate their influences on execution time.Meanwhile we analyze and optimize the realization of spatial computation in GeoSpark,especially the spatial join operation,and compare the difference in execution time when building spatial partitioner and spatial index based on different dataset.Besides,we propose broadcast join method by broadcasting small dataset to avoid shuffle operations and improve efficiency,which is specific to the join condition where one of the two datasets is quite small and the other is enormously large.2)Put forward a solution for Spark adaptive parameter tuning.Considering the intricacy of parameter tuning,we dig deeper into the inner structures of Spark application execution,and transform the total application execution time into the combination of multiple tasks' execution time.Based on it,we then analyze potential factors that might influence task execution time and build a regression model in between.By using this model to predict task execution time and the total application execution time with various parameter combinations,we can pick up a proper one as our recommendation.Thus we have reached our goals of adaptive parameter tuning.3)Realizations of spatial parallel computation and its optimization and their systematization.According to the demands of City Traffic Management Bureau,we conclude and realize a batch of basic computation operators,and depict the outer system and the scheduling of operators.On the other hand,detailed introductions about the realization of adaptive parameter tuning are mainly focused on the overall architecture and the interaction with the outer system to present its availability.
Keywords/Search Tags:Spark, GeoSpark, Spatial Parallel Computation Optimization, Adaptive Parameter Tuning, Machine Learning, Regression
PDF Full Text Request
Related items