Research And Application Of Distributed ETL Based On Spark

Posted on:2018-07-07

Degree:Master

Type:Thesis

Country:China

Candidate:S L Xie

Full Text:PDF

GTID:2348330536452503

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

In the era of big dada,more and more data need to be processed and used.For enterprises,data have become their survival basis.It is crucial to future developments of an enterprise whether it can make good use of its data.Data warehouse technology provides an effective solution for the enterprise to analyze massive data.In the course of data warehouse building,ETL is always the most time-consuming and complicated phase in the whole process.As the increasing amount of processing data,a higher performance requirement is proposed for ETL technology,which also brings a greater challenge.In response to massive data processing of ETL,distributed parallel technologies are necessary to perform ETL process.At present,the distributed ETL solution based on Map Reduce paradigm could realize the efficient processing of massive data.However,due to the limitations of the Map Reduce programming model,only two kinds of processing methods,that is,Map/Reduce,as well as high I/O overhead in multi-step processing,which make it have some performance problems in the transformation process of ETL,and there are a lot of optimization space in terms of processing efficiency and processing speed.Aiming at its excessive "volume" of big data,and the limitation of distributed ETL solution based on Map Reduce paradigm,based on the data warehouse theory and distributed processing technology,this paper studied the distributed parallel ETL technology based on Spark,and then proposed a solution of distributed ETL at the same time,which focus on the parallel implementation of transformation processing in data transformation process,put forward the applicable methods according to the different transformation types.Aiming at early non-aggregation operations,such as data cleaning,data format standardization,a parallel pipeline processing algorithm based on partitions is presented,to process data in partition units,thereby improve the efficiency of the data transformation process;as for aggregation operations,such as numerical data aggregation of the fact table,a Pre-aggregation algorithm is provided,to reduce the frequency of data transmission during the aggregation process.The conducted experiment shows that the proposed methods could significantly accelerate the transformation processing of mass data,improve the performance and efficiency of distributed ETL.After that,this paper studied the performance optimization of the data processing flow based on Spark.Then the common data skew problems in data processing based on Spark was analyzed in detail,according to different data skew conditions,the corresponding parallel optimization strategies were given respectively.The experiment shows the effectiveness of the optimizing strategy.Finally,based on the development of a real decision support system,this paper expounded the design and application of distributed ETL based on Spark,including the comparative analysis with the traditional ETL development scheme.The analysis results show the effectiveness and high scalability of the Spark-based distributed ETL scheme proposed in this paper.

Keywords/Search Tags:

big data, distributed ETL, spark, transformation processing, performance optimization

PDF Full Text Request

Related items

1	Research And Implementation Of Spark Performance Optimization For Police Data Processing
2	The Implementation Of Remote-Memory Management System And Performance Optimization In Spark
3	Real-time Mass Data Processing Analysis And Optimization Based On Spark
4	The Performances Of Distributed Big Data Processing Modes In High-speed Traffic Network
5	Design And Implementation Of Telecom 4G Big Data Platform For Network Optimization Based On Spark
6	Structured Data Processing And Performance Optimization Of Spark SQL
7	A System For Distributed MD Data Analysis Based On Spark
8	Performance Analysis And Optimization Of Cloud Based Big Data Processing Platforms
9	Bigdata Job Performance Prediction Based On Apache Spark
10	Design And Implementation Of A Distributed And Real Time Video Stream Data Processing Platform Based On Spark