Font Size: a A A

Research On The In-Memory Data Management Technology On Spark Data Processing Framework

Posted on:2017-09-27Degree:MasterType:Thesis
Country:ChinaCandidate:H H WangFull Text:PDF
GTID:2348330503992921Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Spark, the in-memory computing framework, is a cutting-edge platform for the massive data processing. Spark introduces the concept of RDD(Resilient Distributed Datasets) as the abstract expression of the distributed massive-volumed data set and adopts the in-memory RDD data storage and processing mechanism, which speeds up the execution of big data application and makes the big data manipulation easier.The in-memory data space management is one of essential features in Spark. Higher management efficiency means more RDD data cached in memory and more contribution to the service quality of Spark platform. On the original Spark platform, in-memory data space management is based on the JVM heap memory embodied in the task execution container, which is the shared space for the RDD caching data storage and temporal data storage during computation. On the other hand, the memory space allocated for each task execution container within a Spark application is set statically and symmetrically. However, the management mode mentioned above leads to storage disturbance between the RDD data and temporal data and contributes to the frequent cache missing of RDD data and the heavy overhead of data recomputation when dealing with the scenario that the memory requirement among task execution containers are fluctuant and asymmetric.To solve this problem, we propose the shared memory-based distributed RDD data space management mechanism and dynamic space allocation strategy on it. The essential character of RDD space management mechanism and strategy lies two keypoints: one is the isolation between the in-memory data space of RDD data and temporal data by providing a distributed shared memory space to accomodate the RDD data from all execution containers, the other one is the dynamic adjustion of data space between the RDD data and temporal data. The proposed data space management can balance the asymmetric memory requirement among execution containers, at the mean time maximize the probability of the in-memory caching of RDD data so as to improve the performance of big data application. The main contribution of this thesis are as the follows:1) The data organization and management model of the shared memory-based RDD data space. The distributed cooperative shared memory organization model is introduced to maximize the use of the distributed memory space. The Master/Slave management model is adopted to enable the RDD data sharing among spark applications.2) The data migration strategy and mechanism for share memory-based RDD data space. Considering the new features of tightly coupling of computing and data, different generation cost of cached RDD data and data sharing needs between applications, the quantification model of data migration cost and the affinity of the migration destination node is designed. Based on such quantification models, the data migration strategy is presented with the goal of the minimum data recomputation overhead and the maximum probability of data-locality aware data processing.3) The dynamic in-memory space allocation strategy for the RDD data and temporal data. The dynamic allocation is based on the accurate prediction of the inmemory space requirement of the temporal data in each task execution container. The strategy first digs out periodic character of memory space requirement of temporal data with the autocorrelation function. For the periodic requirement, the strategy makes out prediction with the similarity feature, and for the non-periodic requirement, the prediction is make out with the discrete Markov chain stochastic process.4) A shared memory-based RDD data space dynamic management prototype system, called SMSpark, is proposed. SMSpark is designed based on Spark, which is an open source platform. By using typical workloads on Spark, experimental results show that SMSpark outperforms Spark on the cluster memory utilization by the average of 44.16%, the average application turnaround time by the average of 19.89%.
Keywords/Search Tags:Big Data, Distributed In-memory Computing framework, Spark, Inmemory data space management, Dynamic space allocation
PDF Full Text Request
Related items