Font Size: a A A

Research On The Key Technology Of Spark Data Caching For Time-window-based Data Analysis

Posted on:2020-11-16Degree:MasterType:Thesis
Country:ChinaCandidate:J D ChenFull Text:PDF
GTID:2428330623456755Subject:Engineering
Abstract/Summary:PDF Full Text Request
Spark is a representative in-memory computing system of large data,which accelerates the execution of iterative and interactive large data applications with memory-based data cache.Data analysis based on time window is a typical bigdata application.It is characterized in that massive data sets are sequentially subjected to local analysis and processing based on time window according to the time sequence of data generation,and local processing results are globally aggregated to form final analysis results.In this paper,in the light of the data access mode of this kind of application,the research on Spark system data cache technology is carried out.Through the design and implementation of data cache programming interface,time window data prefetch mechanism and local result cache placement strategy,the data reading efficiency of data analysis application based on time window in Spark system is improved,thus accelerating the operation of this kind of application.The main contributions of this paper are as follows:1)The RDD dynamic update mechanism for Spark time-window-based data analysis application is proposed and the programming interface is extended.According to the data processing time sequence characteristics of time window data analysis application,the mechanism modifies the RDD generation mechanism to realize timesharing and dynamic updating of RDD data and ensure cache hit of current time window data.Meantime,the corresponding extended programming interface is provided to support users to express the time-sharing processing requirements of time window data,thus reducing the difficulty of application development.2)A pipeline-based cache RDD data prefetching mechanism is proposed.This mechanism designs the timing and scale of prefetching RDD data on the basis of preevaluating the expansion scale of the result data generated by the time window data processing.Meanwhile,it makes the decision of cache data placement on the basis of comprehensively considering the data localization processing and the cache margin of the task executor,thus improving the cache hit rate of the application and realizing the load balancing among the task executors.3)A intermediate result data migration strategy for Spark time-window-based data analysis application is proposed.This strategy triggers the migration of local result data according to the generation of prefetched data and local result data as the migration opportunity,determines the data partition set to be migrated according to the size of prefetched data and the size of generated local result data,and then uses genetic algorithm to optimize the migration strategy of local result data aiming at the minimum migration cost and the optimal matching degree of computing power,thus making full use of memory space in the execution process of Spark time window data analysis application.4)Combined with the above,the existing Spark is extended,and the TW-Spark system is designed and implemented.The performance of the proposed method is tested and analyzed with real data sets.The performance evaluation results show that compared with the existing Spark platform,the prefetch mechanism and intermediate result data migration strategy proposed in this paper can reduce the application execution time of Spark time-window-based data analysis application by 95.34% at the maximum and 77.72% on average.
Keywords/Search Tags:Spark, Time-Window-based data analysis, Data prefetching, Data migration
PDF Full Text Request
Related items