Research On Spatial Parallel Computation And Adaptive Parameter Tuning Based On Spark

Posted on:2020-07-03

Degree:Master

Type:Thesis

Country:China

Candidate:X Peng

Full Text:PDF

GTID:2480305897967529

Subject:Cartography and Geographic Information System

Abstract/Summary:

PDF Full Text Request

Big Data is a quite popular research issue in recent years,and has been widely used in many other subjects.The focus in GIS industry is on spatial-temporal Big Data.An essential problem in mining and utilizing spatial-temporal Big Data would be how to support spatial Big Data computation.As is known to us all,the popular Big Data frameworks,such as Hadoop,Spark,and Storm,are not in favor of the processing of spatial data,while others that are built on top of Hadoop or Spark may suffer from the incompleteness of the support of spatial computation method or the inefficiency of spatial computation,such as GeoSpark,Geo Mesa,Spatial Hadoop,Hadoop GIS,etc.Hence,we explore the systematization of spatial computation based on Spark and GeoSpark,discuss some Spark parameter tuning,optimize the realization of spatial computation in GeoSpark,and propose a solution to Spark adaptive parameter tuning due to the intricacy of parameter tuning.The details are as follows:1)Discuss the tuning of some important parameters including memory size,number of cores,partition number and type of data persistence,and evaluate their influences on execution time.Meanwhile we analyze and optimize the realization of spatial computation in GeoSpark,especially the spatial join operation,and compare the difference in execution time when building spatial partitioner and spatial index based on different dataset.Besides,we propose broadcast join method by broadcasting small dataset to avoid shuffle operations and improve efficiency,which is specific to the join condition where one of the two datasets is quite small and the other is enormously large.2)Put forward a solution for Spark adaptive parameter tuning.Considering the intricacy of parameter tuning,we dig deeper into the inner structures of Spark application execution,and transform the total application execution time into the combination of multiple tasks' execution time.Based on it,we then analyze potential factors that might influence task execution time and build a regression model in between.By using this model to predict task execution time and the total application execution time with various parameter combinations,we can pick up a proper one as our recommendation.Thus we have reached our goals of adaptive parameter tuning.3)Realizations of spatial parallel computation and its optimization and their systematization.According to the demands of City Traffic Management Bureau,we conclude and realize a batch of basic computation operators,and depict the outer system and the scheduling of operators.On the other hand,detailed introductions about the realization of adaptive parameter tuning are mainly focused on the overall architecture and the interaction with the outer system to present its availability.

Keywords/Search Tags:

Spark, GeoSpark, Spatial Parallel Computation Optimization, Adaptive Parameter Tuning, Machine Learning, Regression

PDF Full Text Request

Related items

1	Design And Implementation Of A Spark Autotuning System
2	Parallel Computing Of Spark-based Geospatial Analysis Algorithms
3	Tuning Parameter Selection In Multivariate Linear Regression With Penalty Function
4	Study Of Several Optimization Algorithms For Support Vector Machine
5	The Theory Of Machine Learning And Its Applications In The Hydrological Forecasting
6	Research On The Application Of Machine Learning In Optimizing Earnings Consensus
7	Research On Algorithms And Applications Of Support Vector Regression
8	Key Techniques Research On Quantum Machine Learning
9	Research On Feature Selection Method Based On Improved Self-tuning Adaptive Genetic Algorithm
10	Temporal Profiles Of The Atomic Emissions In High Repetition Rate Laser-ablation Spark-induced Breakdown Spectroscopy And A Related Machine Learning Approach For Alloy Classification