Font Size: a A A

Research On Dynamic Placement Of RDD Data For Interactive Spark Applications

Posted on:2019-11-20Degree:MasterType:Thesis
Country:ChinaCandidate:S F ChengFull Text:PDF
GTID:2428330593450432Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Apache Spark is a distributed memory computing platform that represents the latest technological advances in the field of massive data processing.RDD is the abstract expression of massive data in the Spark.The interactive application is a kind of typical application in Spark with strong request arrival uncertainty.The dynamic resource allocation management is used during the operation of the interactive Spark application.That is,based on the strength of the interactive request arrival,the resources occupied by the application are dynamically increased or decreased to improve the utilization rate of the platform resources.However,the existing dynamic resource allocation mechanisms cannot save cached RDD in a closed executor.When RDD data is multiplexed,it will cause recalculation cost,which will affect the execution efficiency of the job.In order to solve the above issues,this paper proposes a dynamic RDD placement strategy for interactive applications.The core idea of this strategy is migrating the cached RDD dynamically to the storage space of the non-closed executors in order to improve the execution efficiency of the interactive Spark application and the memory utilization of Spark.The main contributions of this paper are as follows:1)A dynamic placement strategy of cached RDD for interactive spark applications is proposed.This strategy quantifies the placement revenue of the cached RDD partitions firstly,taking RDD partitions as granularity,and selecting cached RDD integrity,recalculation cost,access frequency,and migration overhead as the main factors.Secondly,the Particle Swarm Optimization is used to optimize the selection of cached RDD partition placement scheme with the goal of maximizing the placement revenue and maximizing the removed CPU resources,thereby realizing the cache of high-valued RDD partitions and the full use of the memory space of executors.2)A non-active prediction algorithm of the interactive application based on Markov model is proposed.Due to the difference in the period of inactivity,there is a “jitter” in the data placement and the problem of data placement efficiency is reduced.This paper uses Markov model to design an algorithm for predicting the change trend of inactive period duration for interactive Spark applications based on the duration of historical inactive period.Based on this predictiton,combined with the time cost of historical data placement,the timing of data placement in subsequent inactive periods will be determined.3)The completion and implementation of the research results in this paper are based on the dynamic RDD placement of the Spark system prototype.Under Standalone's deployment model,the performance of the proposed method was tested and analyzed under the interactive Spark SQL environment using the benchmark test of TPC-H on Hive.The test results show that,compared with the existing dynamic resource allocation of spark,the request response time of interactive queries can be reduced by 97.58% to the utmost extent and 42.61% on average by using the method of the dynamic RDD placement.
Keywords/Search Tags:Data Placement, RDD, Prediction, Spark Dynamic Resource Allocation, Memory Calculation
PDF Full Text Request
Related items