Research And Implementation Of Parallel ETL Tools’Extensible Technology

Posted on:2015-10-17

Degree:Master

Type:Thesis

Country:China

Candidate:J Huang

Full Text:PDF

GTID:2298330467963112

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

ETL tools, which are the foundation of data mining and on-line analytical processing, are used to extract data from distributed heterogeneous data source and load the result into data mart or warehouse after cleaning and transformation. ETL tools usually provide some basic operations, such as correlation, summary, and so on, but due to the diversity of ETL application scenario, the complexity of operation logic, these common operations often cannot satisfy the needs of users, which requires the ETL tools must have certain extensibility, to meet the special needs of various. At the same time, in the era of big data, ETL tools handle huge amounts of data by integrating cloud computing technology. Traditional ETL tools make up for the large data processing by integrating parallel ETL tools such as Hive and Pig, but the existence of the high price of commercial tools and the problem that the open source tools’integration is not enough. Therefore, how to integrate Hive and Pig better in order to realize the expansion of the function is very important. ETL workflow, on the other hand, as a logical plan, needs to be optimized according to a series of optimization rules in the process of being parsed into a physical plan. As the optimization rules are not set in stone and new optimization rules would be concluded in the process of using ETL tool, we need to make the optimization rules have high scalability.In this paper, based on Hadoop and B/S mode, we put forward a parallel ETL system and study how to extend the parallel ETL system. The main work in this paper includes:Through analyzing the implementation details of the MapReduce parallel computing framework, design and realize two kinds of solutions to complete the function extension of dealing with the’complex requirements by embedding custom MapReduce code in the existing tool.Based on the analysis and summary the language grammar characteristics of Hive and Pig script, combined with the actual application requirements, select a set of basic operations and design functional components according to them. Then through analyzing the dependency between these operations, design and implement the workflow parsing module, which parses a workflow into a script with the same logic as the manually written script. This integration way extends the functionality of the parallel ETL tool and ensures that the system can provide a unified graphical user interface at the same time.Through analyzing how Hive and Pig implement their optimization mechanism, design and implement our own mechanism. A rule is designed to be a set of matching pattern and the corresponding operation, the mechanism of matching the rules and walking in the plan is isolated and abstracted. Based on this kind of design, optimization rules can be extended easily.

Keywords/Search Tags:

ETL, extensibility, MapReduce, Hive, optimization rule

PDF Full Text Request

Related items

1	Innovative Research On Hive Mind With Internet
2	The Research And Implementation Of MapReduce-based Distributed Rule Matching System
3	Design And Implementation Of Hive-based Purchase And Sale Data Warehouse System
4	The Research And Practice Of Performance Optimization Based On Hive
5	Research On Hive Query Optimization Base On Parquet Format
6	An Implementation Of Rule Engine Suitable For Salary Computing
7	The Implementation And Optimization Of Log Analysis System Based On Hive
8	Research And Optimization Of Distributed Spatial Database Based On Hive
9	Research On The Related Issues Of Network Service Extensibility
10	Design And Implementation Of An Alert Correlation System Based On Rule Engine