Font Size: a A A

Research On Serialization Storage Mechanism Based On Spark Cluster

Posted on:2018-08-13Degree:MasterType:Thesis
Country:ChinaCandidate:F F YangFull Text:PDF
GTID:2348330569986475Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Distributed computing framework provides a computing platform for the promotion of big data applications in various fields.MapReduce provides an effective way for distributed computing of batch data.However,a large number of I/O access cost reduces the efficiency of data processing,and can not meet the requirements of high efficiency and real-time data processing.Spark computing framework emerges as the time requires.Spark is a fast and universal parallel computing framework designed for large-scale data processing.It has the characteristics of memory computing,high error tolerance and high scalability.This framework uses RDD(Resilent Distributed Datasets)to handle data.The data processing is mainly done by memory iteration,so it can greatly improve the computational efficiency.In this thesis,according to the existing problems in Spark based on RDD memory computing model,the following work has been done:1.For the problem of serialization storage resulting in low computation efficiency in memory iterative computation,this thesis proposes a serialized storage strategy based on the kind of operators,the size of dataset,RDD's efficiency and other factors.Then establish normalized weighting model of RDD to get the collection of serialized storage.Select the valuable RDD to store in the memory when lacking of memory.Compared with the default serialization storage mechanism in Spark,the experimental results show that this strategy can promote the computational efficiency of tasks,improve the utilization rate of the memory.2.Because single node can not store all the data,the entire cluster has low efficiency.This thesis further puts forward to the global serialization strategy,which bases on the design framework of Tachyon(distributed cache system).Then design a cache layer interface of RDD,in order to store in the Tachyon when lacking memory,maximize the RDD memory computing and enhance the processing ability of computing framework.Experimental results verify the effectiveness of the global serialization storage strategy.In conclusion,memory storage scheme for serialization is a key factor to affect Spark processing ability.The optimization can effectively improve the overall performance of Spark parallel computing framework.
Keywords/Search Tags:Spark, memory, RDD, operator, serialization storage
PDF Full Text Request
Related items