| In the era of big dada,more and more data need to be processed and used.For enterprises,data have become their survival basis.It is crucial to future developments of an enterprise whether it can make good use of its data.Data warehouse technology provides an effective solution for the enterprise to analyze massive data.In the course of data warehouse building,ETL is always the most time-consuming and complicated phase in the whole process.As the increasing amount of processing data,a higher performance requirement is proposed for ETL technology,which also brings a greater challenge.In response to massive data processing of ETL,distributed parallel technologies are necessary to perform ETL process.At present,the distributed ETL solution based on Map Reduce paradigm could realize the efficient processing of massive data.However,due to the limitations of the Map Reduce programming model,only two kinds of processing methods,that is,Map/Reduce,as well as high I/O overhead in multi-step processing,which make it have some performance problems in the transformation process of ETL,and there are a lot of optimization space in terms of processing efficiency and processing speed.Aiming at its excessive "volume" of big data,and the limitation of distributed ETL solution based on Map Reduce paradigm,based on the data warehouse theory and distributed processing technology,this paper studied the distributed parallel ETL technology based on Spark,and then proposed a solution of distributed ETL at the same time,which focus on the parallel implementation of transformation processing in data transformation process,put forward the applicable methods according to the different transformation types.Aiming at early non-aggregation operations,such as data cleaning,data format standardization,a parallel pipeline processing algorithm based on partitions is presented,to process data in partition units,thereby improve the efficiency of the data transformation process;as for aggregation operations,such as numerical data aggregation of the fact table,a Pre-aggregation algorithm is provided,to reduce the frequency of data transmission during the aggregation process.The conducted experiment shows that the proposed methods could significantly accelerate the transformation processing of mass data,improve the performance and efficiency of distributed ETL.After that,this paper studied the performance optimization of the data processing flow based on Spark.Then the common data skew problems in data processing based on Spark was analyzed in detail,according to different data skew conditions,the corresponding parallel optimization strategies were given respectively.The experiment shows the effectiveness of the optimizing strategy.Finally,based on the development of a real decision support system,this paper expounded the design and application of distributed ETL based on Spark,including the comparative analysis with the traditional ETL development scheme.The analysis results show the effectiveness and high scalability of the Spark-based distributed ETL scheme proposed in this paper. |