Font Size: a A A

Research And Optimization Of Distributed ETL Based On MapReduce

Posted on:2018-06-18Degree:MasterType:Thesis
Country:ChinaCandidate:J Y ZhangFull Text:PDF
GTID:2348330536452518Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the continuous development of information technology,the amount of information in modern society is increasing,which contains a large amount of data.In order to better respond to the change of business environment in the era of big data,many enterprises have to process and analyze business data,and explore the valuable commercial information from it to make it convenient for business decision-makers to make business decisions quickly and correctly.In data warehouse,the ETL process is more than 70% of the workload,so the research and optimization of ETL has important significance.With the advent of the era of big data,the data show explosive growth,so the traditional ETL is facing huge challenges,not only the price is expensive but also the use of centralized architecture has low processing efficiency.In order to improve the processing efficiency of massive data,the distributed ETL based on Map Reduce framework has been widely used in enterprises.The Map Reduce framework has the advantages of high scalability and high fault tolerance.It processes data in parallel way,and is easy to operate in which way it can greatly improve the efficiency of data processing.The process of ETL data is mainly processing the factual data and the dimension data,and the dimension data is an important part of the data analysis.The gradient dimension data of the dimension data is more complex,basing on the knowledge of large data distributed ETL processing technology and Map Reduce job scheduling algorithm,this paper focuses on the parallel processing optimization of dimension data in distributed ETL and the optimization of Map Reduce job scheduling.The main research contents of this paper are as follows:(1)This paper studies the distributed ETL technology and the development of Map Reduce job scheduling algorithm,and expounds the significance of this study.The paper studies and analyzes deeply on the Map Reduce programming model,ETL technology,data segmentation strategy,update strategy of the gradient dimension data and Map Reduce job scheduling algorithm.(2)The paper does research on parallel processing of gradient dimension data.MOAP algorithm is proposed to deal with a large number of network transmission overhead in the process of map/reduce processing.Operating the reduce task locally,putting the same business key data in the same calculation node to calculate.In order to facilitate subsequent incremental data processing,the processed data is still in accordance with the key business location.The experiment sees the Type-1 and Type-2 gradient dimension data as the experimental object,under the cases of the initialization and incremental loading,the time needed by Type-1 and Type-2 dimension gradient data using the MOAP algorithm and without using the MOAP algorithm is compared in the paper.Experimental results show that,when the algorithm is used,the time of processing the two kinds of gradient dimension data is reduced obviously,and the efficiency of ETL is improved because of the reduced network transmission overhead.(3)The paper does research on scheduling algorithm for reduce task.The SBOTA algorithm is proposed for the problem of reduce task starvation and low resource utilization.Embedding the SBOTA algorithm in the existing fair scheduler,and taking alternate scheduling strategies on big tasks and small tasks.In the experiment,the use of fair scheduling algorithm and SBOTA scheduling algorithm is compared,it is found that the SBOTA algorithm proposed in this paper is not only reducing the processing time of big tasks and small tasks,but also improving the utilization rate of CPU and memory resources.(4)The application of distributed ETL based on Map Reduce.Aimed at the shortcoming of existing data analysis and decision system,designed a new data analysis and decision system,the system architecture,the fact table and dimension table are analyzed,finally the applications of the proposed algorithm to the new data analysis and decision system,accelerate the intermediate data processing.
Keywords/Search Tags:Map Reduce, Distributed ETL, Gradient dimension, Job scheduling
PDF Full Text Request
Related items