Font Size: a A A

Research And Implementation Of Spark Performance Optimization For Police Data Processing

Posted on:2019-02-18Degree:MasterType:Thesis
Country:ChinaCandidate:J ChenFull Text:PDF
GTID:2348330569495573Subject:Engineering
Abstract/Summary:PDF Full Text Request
Nowadays,data has been generated at an alarming rate in all fields.The data collected by the public security organizations has been increasing day by day.The continuous flow of data has brought challenges to the public security organizations.It is difficult for traditional databases to increase data processing capacity by horizontally extending hardware.In this overwhelming form,the big data processing engine Spark emerged.Spark makes full use of memory resources,adopts an advanced DAG scheduling model,and provides a pipeline calculation method.Therefore,it has powerful advantages in big data processing.However,Spark,as a general-purpose computing framework,still has a lot of space for optimization when it is applied.This thesis starts with the performance issues exposed by Spark in police data processing to carry out optimization research.The main work and research results are as follows:(1)The performance issues faced by this thesis are essentially because it is difficult to accurately estimate the task execution time.Therefore,this thesis first studies a task execution time prediction method.After deeply researching the performance prediction methods of existing distributed computing systems,this thesis uses RBF neural network to build prediction model.The traditional gradient descent training method for RBF neural network is easy to fall into the problem of local extreme points,therefore,this thesis introduces the PSO algorithm to optimize parameters.In addition,the global optimization capability of PSO algorithm is improved by applying Chebyshev chaotic maps.Finally,the model is trained by a combination of gradient descent and improved PSO.Experiments show that the method can achieve smaller training error.(2)Occasionally,Spark's default delay scheduling algorithm will wait too long for the preferred location,resulting in low resource utilization.Therefore,based on the task execution time prediction model,a task that has a smaller performance loss after abandoning a preferred location is scheduled to be executed on the currently idle resource.The experimental results show that the performance has improved at different delay times.(3)Spark's existing task speculation execution algorithm uses a simple statistical method,which will generate more unnecessary backup tasks in a heterogeneous environment and waste cluster computing resources.Therefore,this thesis is based on the task execution time prediction model to accurately identify backward tasks and launch backup tasks.In addition,in order to further speed up the progress of job execution,this thesis adds a task migration acceleration strategy to the speculative execution algorithm.(4)Spark does not provide automatic data caching.The adaptive caching strategy proposed by previous people for the underlying RDD does not apply to the Spark SQL dataset.After absorbing the results of previous studies,this thesis proposes an execution plan cost model based on task execution time prediction.Then,this thesis proposed the adaptive cache strategy of dataset,and in order to further improve the performance,this thesis adopted a method of union and push-down optimization to reduce the size of intermediate results.(5)After the optimization algorithm was applied to Spark,a policing data processing platform based on Spark was designed and implemented.The platform has the characteristics of dynamic construction of query plan and fast data query.
Keywords/Search Tags:Spark, Performance optimization, Performance prediction, Task scheduling, Adaptive cache
PDF Full Text Request
Related items