Dynamic Optimization Of Spark RDD Storage Solutions

Posted on:2018-03-22

Degree:Master

Type:Thesis

Country:China

Candidate:Y P Feng

Full Text:PDF

GTID:2428330590477763

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

With the increasing demands of big data processing,distributed data processing frameworks like Hadoop and Spark are enjoying growing popularity.Large-scale data processing is ubiquitous in both scientific researches and enterprise infrastructures.These data analytics workloads,such as image processing and prediction of customers' demands,are usually facilitated by these data processing frameworks.These frameworks provide both scalability and fault tolerance to ensure high availability.Furthermore,they simplify the development and implementation of miscellaneous workloads.For example,built on Spark core,Spark provides MLlib and GraphX for developing machine learning and graph computation programs.Performance is always a key point in large-scale data processing.The enhancement of performance is also an important goal to achieve in order to improve these large-scala data processing frameworks.In this paper,a cost-based and dynamic optimization of Spark's storage solutions is proposed,which aims at relieving users of caring about the details of Spark's storage solutions.It also avoids performance degradation which results from users' limitations of being unable to know the details of hardware settings and the characteristics of applications.Based on the optimization of RDD storage levels,which is the way Spark uses to present data storage mechanisms,the optimization in this paper also aims at finding a balance among multiple resource usages and making the best use of resources.The optimization process presented in this paper is an offline optimization method,which consists of a data collection process and an optimization process.In the data collection process,it collects data that reflects the runtime characteristics of RDDs.These characteristics being application-aware and resource-aware makes the optimization in this paper dynamic.During the data collection process,data is also pre-processed to make it eligible for being inputs to the optimization process.In the optimization process,based on the characteristics of RDDs,the optimization method proposed in this paper implements a heuristic algorithm and searches for better settings of RDD storage levels.During the optimization process,based on the running time and memory usage of RDD under different storage levels,the optimization process evaluates the total running time of the application under different storage levels by a simulation of the whole application,and optimizes the setting of RDD storage levels continually.The evaluation shows that the cost-based optimization proposed in this paper is both necessary and effective.

Keywords/Search Tags:

Apache Spark, RDD, Data Storage, Performance Optimization

PDF Full Text Request

Related items

1	Design And Implementation Of A Performance Modeling System On Apache Spark
2	OCTWAS - Online Check-pointer for Workflows on Apache Spark
3	Performance Prediction And Optimization For Apache Spark Platform
4	Bigdata Job Performance Prediction Based On Apache Spark
5	Cost-based Configuration Optimization Analysis For Apache Spark
6	Performance Prediction And Optimization Of Apache Spark Based On SRFRP Model
7	Research On Apache Spark Distributed Parallel Computing Framework Optimization Technology
8	Research On Association Mining Optimization Based On Spark Distributed And Application Of Comprehensive Decision
9	Study On The Analysis And Optimization Of Column Storage Performance Based On Hive On Spark
10	A System For Distributed MD Data Analysis Based On Spark