Font Size: a A A

Adaptive Memory Management Research Based On In-Memory Computing Characteristics In Spark

Posted on:2017-09-08Degree:MasterType:Thesis
Country:ChinaCandidate:Y ZhaoFull Text:PDF
GTID:2428330590988901Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the advent of the information age there comes the huge number of data.In the 21 st century which is called the "big data" era,how to process big data has become a big problem.Currently,the distributed computer technology is a mainstream way to deal with big data.By establishing distributed cluster,distributed system can obtain great computing power like supercomputer and huge storage capacity.Furthermore,these abilities can be improved with the expansion of the cluster.However,the shortage of memory capacity is a main factor which restricts the performance of distributed systems.In recent years with the improvement of memory manufacturing process,Spark,which is a new kind of distributed framework based on in-memory computation,comes out.When facing iterative machine learning computation and interactive query,the performance of Spark exceeds other distributed frameworks.However,the memory capacity is always much smaller than the data volume.In this case Spark will encounter performance bottlenecks.How to make better use of memory is a key issue to enhance Spark performance.To solve this problem,we design an adaptive memory tuning strategy.This strategy consists of three adaptive tuning algorithms: The first algorithm is the serialization adaptive algorithm.Data serialization is an optimization approach which is commonly used in distributed systems.Whether to use proper data serialization has a great influence on system performance.Serialization can save storage space.At the same time,it can reduce the pressure of the system garbage collection.Meanwhile,the data to be transmitted between the nodes in the cluster,also need to be serialized.Serialization adaptive strategy will select an appropriate serialization algorithm based on the system resource consumption.The second algorithm is the compression adaptive algorithm.Compression algorithm can compress data to a fraction,or even a dozen of the original data volume.Compared to the serialization algorithm,the system can save much more storage space.However,different compression algorithms perform differently.Compression adaptive algorithm will choose the appropriate compression algorithm according to system information.Finally,the last algorithm is the garbage collection adaptive algorithm.Spark is a distributed framework which runs on JVM virtual machine.So the performance of JVM is directly related to the performance of the entire system.Garbage collection adaptive algorithm tunes garbage collection by collecting and analyzing the information of the current system.In the design section,through adding and modifying Spark source code,we implements SATS(Spark Adaptive Tuning Strategy)subsystem on Spark.The subsystem is divided into three modules: run-time data collection module,adaptive decision module and parameter optimization module.The implementation section describes the details of these three modules.In the experimental part,we analyze the experiment results in detail and verify the effectiveness of adaptive tuning strategy.
Keywords/Search Tags:big data, distributed computer system, Spark, in-memory computation, adaptive
PDF Full Text Request
Related items