Research On The Key Technology Of Spark Data Caching For Time-window-based Data Analysis

Posted on:2020-11-16

Degree:Master

Type:Thesis

Country:China

Candidate:J D Chen

Full Text:PDF

GTID:2428330623456755

Subject:Engineering

Abstract/Summary:

PDF Full Text Request

Spark is a representative in-memory computing system of large data,which accelerates the execution of iterative and interactive large data applications with memory-based data cache.Data analysis based on time window is a typical bigdata application.It is characterized in that massive data sets are sequentially subjected to local analysis and processing based on time window according to the time sequence of data generation,and local processing results are globally aggregated to form final analysis results.In this paper,in the light of the data access mode of this kind of application,the research on Spark system data cache technology is carried out.Through the design and implementation of data cache programming interface,time window data prefetch mechanism and local result cache placement strategy,the data reading efficiency of data analysis application based on time window in Spark system is improved,thus accelerating the operation of this kind of application.The main contributions of this paper are as follows:1)The RDD dynamic update mechanism for Spark time-window-based data analysis application is proposed and the programming interface is extended.According to the data processing time sequence characteristics of time window data analysis application,the mechanism modifies the RDD generation mechanism to realize timesharing and dynamic updating of RDD data and ensure cache hit of current time window data.Meantime,the corresponding extended programming interface is provided to support users to express the time-sharing processing requirements of time window data,thus reducing the difficulty of application development.2)A pipeline-based cache RDD data prefetching mechanism is proposed.This mechanism designs the timing and scale of prefetching RDD data on the basis of preevaluating the expansion scale of the result data generated by the time window data processing.Meanwhile,it makes the decision of cache data placement on the basis of comprehensively considering the data localization processing and the cache margin of the task executor,thus improving the cache hit rate of the application and realizing the load balancing among the task executors.3)A intermediate result data migration strategy for Spark time-window-based data analysis application is proposed.This strategy triggers the migration of local result data according to the generation of prefetched data and local result data as the migration opportunity,determines the data partition set to be migrated according to the size of prefetched data and the size of generated local result data,and then uses genetic algorithm to optimize the migration strategy of local result data aiming at the minimum migration cost and the optimal matching degree of computing power,thus making full use of memory space in the execution process of Spark time window data analysis application.4)Combined with the above,the existing Spark is extended,and the TW-Spark system is designed and implemented.The performance of the proposed method is tested and analyzed with real data sets.The performance evaluation results show that compared with the existing Spark platform,the prefetch mechanism and intermediate result data migration strategy proposed in this paper can reduce the application execution time of Spark time-window-based data analysis application by 95.34% at the maximum and 77.72% on average.

Keywords/Search Tags:

Spark, Time-Window-based data analysis, Data prefetching, Data migration

PDF Full Text Request

Related items

1	The Research Of Big Data Manipulating Technology Based On Spark
2	Application Research Of Real-time Data Analysis Based On Spark Computing
3	Data Migration Technology Applied Research In The Crbt System
4	Research And Implementation Of Data Acquisition And Data Migration Technology Based On MES
5	Application Migration And Big Data System Deployment On Cloud
6	The Design And Implementation Of Network Data Analysis System Based On Spark Platform
7	Design And Implementation Of Tobacco Big Data Analysis System Based On Spark
8	Design And Implementation Of Data Real-time Analysis And Processing System Based On Spark
9	Design And Implementation Of Telecom 4G Big Data Platform For Network Optimization Based On Spark
10	Effective compile-time analysis for data prefetching in Java