Font Size: a A A

Research On Memory Management And Cache Replacement Policies In Spark

Posted on:2017-08-12Degree:MasterType:Thesis
Country:ChinaCandidate:H T MengFull Text:PDF
GTID:2428330569498547Subject:Software engineering
Abstract/Summary:PDF Full Text Request
The development of science and technology and the progress of the times have brought us into the era of big data.How to convert large amounts of data into a huge value is the big data processing problems in big data age.However,the traditional stand-alone computer is difficult to store and process large amounts of data,a lot of distributed storage and computing framework were developed to solve these problem.Distributed system can manage and coordinate a large number of ordinary computer to complete the storage and computing job,which likes a huge computer to the outer and every computer in the distritributed store and process part of the data.Since the access speed of memory is nanoseconds while disk is microseconds,making full use of memory in distributed system to accelerate storage and processing is research central issue nowadays.Spark is a distributed memory computing system based on the Map-Reduce programming model.Spark proposes a new abstract RDD that provides fault tolerance while parallelizing data processing.Spark is able to store the intermediate data in Map-Reduce process in memory and to cache the important RDDs in the Spark application to improve the performance and memory utilization of the Spark system.This paper studies the implementation mechanism and resource management of Spark system,then studies and test the characteristics of Spark memory,and designs and implements two novel cache policies of Spark system.The main contributions and innovations of this paper are as follows:(1)This paper studies the implementation mechanism,operating mechanism and resource management of Spark,which is a distributed memory computing system.Study the memory management and usage of Spark.Using the BigDataBench big data standard test,through the experimental test and data analysis,study the memory management and use of features of Shuffle Memory and the Storage memory.(2)Design and implement a cache strategy in Spark,distribute weight replacement policy,DWRP.The main idea of DWRP is as follows: first,select some RDD partition depend on the distributed feature of RDD.Then calculate weight of RDD partition selected depend on the RDD partition,the access frequency and other characteristics.Finally,eliminate the RDD partition with the smallest weight.The DWRP strategy is suitable for Spark clusters running multiple Spark applications.(3)Design and implement a cache strategy in Spark,double execution replacement policy,DERP.The main idea of DWRP is as follows: DERP consists of two execution,get the DAG information of this Spark application according to first execution.We use a small part of all input data as input on first execution.The second time,the Spark application is formally executed using all the input data.DERP eliminate some useless RDD partition,cache more valuable RDD partition proactively according to DAG information obtained from first execution.The DERP strategy is suitable for Spark clusters always running a single Spark application.By studying the distributed memory computing system,Spark,our work has provided a powerful technical support for further improving the performance of Spark.We also provides a way to further optimize the memory utilization of other distributed systems.
Keywords/Search Tags:distributed computing, Spark, RDD, memory management, cache replacement policy
PDF Full Text Request
Related items