Font Size: a A A

Dynamic Optimization Of Spark RDD Storage Solutions

Posted on:2018-03-22Degree:MasterType:Thesis
Country:ChinaCandidate:Y P FengFull Text:PDF
GTID:2428330590477763Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the increasing demands of big data processing,distributed data processing frameworks like Hadoop and Spark are enjoying growing popularity.Large-scale data processing is ubiquitous in both scientific researches and enterprise infrastructures.These data analytics workloads,such as image processing and prediction of customers' demands,are usually facilitated by these data processing frameworks.These frameworks provide both scalability and fault tolerance to ensure high availability.Furthermore,they simplify the development and implementation of miscellaneous workloads.For example,built on Spark core,Spark provides MLlib and GraphX for developing machine learning and graph computation programs.Performance is always a key point in large-scale data processing.The enhancement of performance is also an important goal to achieve in order to improve these large-scala data processing frameworks.In this paper,a cost-based and dynamic optimization of Spark's storage solutions is proposed,which aims at relieving users of caring about the details of Spark's storage solutions.It also avoids performance degradation which results from users' limitations of being unable to know the details of hardware settings and the characteristics of applications.Based on the optimization of RDD storage levels,which is the way Spark uses to present data storage mechanisms,the optimization in this paper also aims at finding a balance among multiple resource usages and making the best use of resources.The optimization process presented in this paper is an offline optimization method,which consists of a data collection process and an optimization process.In the data collection process,it collects data that reflects the runtime characteristics of RDDs.These characteristics being application-aware and resource-aware makes the optimization in this paper dynamic.During the data collection process,data is also pre-processed to make it eligible for being inputs to the optimization process.In the optimization process,based on the characteristics of RDDs,the optimization method proposed in this paper implements a heuristic algorithm and searches for better settings of RDD storage levels.During the optimization process,based on the running time and memory usage of RDD under different storage levels,the optimization process evaluates the total running time of the application under different storage levels by a simulation of the whole application,and optimizes the setting of RDD storage levels continually.The evaluation shows that the cost-based optimization proposed in this paper is both necessary and effective.
Keywords/Search Tags:Apache Spark, RDD, Data Storage, Performance Optimization
PDF Full Text Request
Related items