Adaptive Memory Management Research Based On In-Memory Computing Characteristics In Spark

Posted on:2017-09-08

Degree:Master

Type:Thesis

Country:China

Candidate:Y Zhao

Full Text:PDF

GTID:2428330590988901

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

With the advent of the information age there comes the huge number of data.In the 21 st century which is called the "big data" era,how to process big data has become a big problem.Currently,the distributed computer technology is a mainstream way to deal with big data.By establishing distributed cluster,distributed system can obtain great computing power like supercomputer and huge storage capacity.Furthermore,these abilities can be improved with the expansion of the cluster.However,the shortage of memory capacity is a main factor which restricts the performance of distributed systems.In recent years with the improvement of memory manufacturing process,Spark,which is a new kind of distributed framework based on in-memory computation,comes out.When facing iterative machine learning computation and interactive query,the performance of Spark exceeds other distributed frameworks.However,the memory capacity is always much smaller than the data volume.In this case Spark will encounter performance bottlenecks.How to make better use of memory is a key issue to enhance Spark performance.To solve this problem,we design an adaptive memory tuning strategy.This strategy consists of three adaptive tuning algorithms: The first algorithm is the serialization adaptive algorithm.Data serialization is an optimization approach which is commonly used in distributed systems.Whether to use proper data serialization has a great influence on system performance.Serialization can save storage space.At the same time,it can reduce the pressure of the system garbage collection.Meanwhile,the data to be transmitted between the nodes in the cluster,also need to be serialized.Serialization adaptive strategy will select an appropriate serialization algorithm based on the system resource consumption.The second algorithm is the compression adaptive algorithm.Compression algorithm can compress data to a fraction,or even a dozen of the original data volume.Compared to the serialization algorithm,the system can save much more storage space.However,different compression algorithms perform differently.Compression adaptive algorithm will choose the appropriate compression algorithm according to system information.Finally,the last algorithm is the garbage collection adaptive algorithm.Spark is a distributed framework which runs on JVM virtual machine.So the performance of JVM is directly related to the performance of the entire system.Garbage collection adaptive algorithm tunes garbage collection by collecting and analyzing the information of the current system.In the design section,through adding and modifying Spark source code,we implements SATS(Spark Adaptive Tuning Strategy)subsystem on Spark.The subsystem is divided into three modules: run-time data collection module,adaptive decision module and parameter optimization module.The implementation section describes the details of these three modules.In the experimental part,we analyze the experiment results in detail and verify the effectiveness of adaptive tuning strategy.

Keywords/Search Tags:

big data, distributed computer system, Spark, in-memory computation, adaptive

PDF Full Text Request

Related items

1	The Implementation Of Remote-Memory Management System And Performance Optimization In Spark
2	Research On Memory Data Management Technology In Spark
3	Performance Optimization Of Distributed Graph Computation Framework Based On BSP Model
4	Study And Implementation On Distributed Large Scale Matrix Computation Algorithms With Spark
5	Research On A Distributed Graph Data Process Mechanism Based On Spark
6	Research On Automatic Data And Computation Decomposition On Distributed-Memory Systems
7	Research On Fast Data Cube Computation Method Based On Spark Platform
8	Research On The In-Memory Data Management Technology On Spark Data Processing Framework
9	SimRank Computation On Large Graphs Based On Spark
10	Research And Application Based On Spark Text Mining Technology