Research On Serialization Storage Mechanism Based On Spark Cluster

Posted on:2018-08-13

Degree:Master

Type:Thesis

Country:China

Candidate:F F Yang

Full Text:PDF

GTID:2348330569986475

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

Distributed computing framework provides a computing platform for the promotion of big data applications in various fields.MapReduce provides an effective way for distributed computing of batch data.However,a large number of I/O access cost reduces the efficiency of data processing,and can not meet the requirements of high efficiency and real-time data processing.Spark computing framework emerges as the time requires.Spark is a fast and universal parallel computing framework designed for large-scale data processing.It has the characteristics of memory computing,high error tolerance and high scalability.This framework uses RDD(Resilent Distributed Datasets)to handle data.The data processing is mainly done by memory iteration,so it can greatly improve the computational efficiency.In this thesis,according to the existing problems in Spark based on RDD memory computing model,the following work has been done:1.For the problem of serialization storage resulting in low computation efficiency in memory iterative computation,this thesis proposes a serialized storage strategy based on the kind of operators,the size of dataset,RDD's efficiency and other factors.Then establish normalized weighting model of RDD to get the collection of serialized storage.Select the valuable RDD to store in the memory when lacking of memory.Compared with the default serialization storage mechanism in Spark,the experimental results show that this strategy can promote the computational efficiency of tasks,improve the utilization rate of the memory.2.Because single node can not store all the data,the entire cluster has low efficiency.This thesis further puts forward to the global serialization strategy,which bases on the design framework of Tachyon(distributed cache system).Then design a cache layer interface of RDD,in order to store in the Tachyon when lacking memory,maximize the RDD memory computing and enhance the processing ability of computing framework.Experimental results verify the effectiveness of the global serialization storage strategy.In conclusion,memory storage scheme for serialization is a key factor to affect Spark processing ability.The optimization can effectively improve the overall performance of Spark parallel computing framework.

Keywords/Search Tags:

Spark, memory, RDD, operator, serialization storage

PDF Full Text Request

Related items

1	Research On Fault Tolerance And Improvement Strategy For Storage Layer Under In-memory Computing Environment
2	Adaptive Memory Management Research Based On In-Memory Computing Characteristics In Spark
3	Research On Workload-specific Memory Configuration Of Spark Workloads
4	Data Transmission And Storage Method Optimization Of Spark Shuffle
5	Research On Memory Optimization Technology Of Spark Computing Engine
6	Dynamic Optimization Of Spark RDD Storage Solutions
7	The Research On Spark Task Scheduling Strategy Based On Dynamic Memory Awareness
8	Design And Implementation Of Zero Copy RPC Over RDMA
9	On The Low Overhead Configuration Optimization Of In-memory Big Data Query Engine
10	Design And Implementation Of Telecom Operator's Income Apportionment System Based On Spark Platform