Font Size: a A A

The Research Of Optimizing ETL Execution Process

Posted on:2007-01-07Degree:MasterType:Thesis
Country:ChinaCandidate:Y H WuFull Text:PDF
GTID:2178360212965614Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
ETL is a tool responsible for data loading and maintaining of data warehouse .How to efficiently shorten the execution time is a big challenge, because the volume of data to be processed is very large. So far, leading commercial tools allow the design of ETL wokflows, but do not use any optimization technique.The designed workflows are propagated to the DBMS for execution; thus the DBMS undertakes the task of optimization.An ETL process can not be considered as a big query.In the process of ETL, each of activity is related, so the whole optimization must be considered.In this paper, we delve into the logical optimization of ETL processes .First we give a formal definition of the constituents of an ETL workflow.Then we defined a set of transitions that can be applied to the states. We also provide details on how states are generated , the conditions under which transitions are allowed and the determinant rules of equivalent workflows.So we set up the theoretical framework for the problem ,by modeling it as a state-space search problem,with each state graph representing a particular design of the workflow as a graph,equivalent workflows are produced from state transitions , the state space is fabricated through a set of correct state transitions,and the minimization of the execution cost of an ETL workflow is the best one.Moreover, we provide algorithms towards the minimization of the execution cost of an ETL workflow.First we use an exhaustive algorithm to explore the search space in its entirety and to find the optimal ETL workflow. Then we introduce greedy and heuristic search algorithms to reduce the search space that we explore, and demonstrate the efficiency of the approach through a set of experimental results.At last, the realization technology is provided about the optimizer cost model based on statistics.We provide the cost estimate measure of different activities, the statistics that the computing needs, the script that collects statistics data needed,and predigest the management of the statistics data combined with the characteristic of the predication attribute of activities.Finally,we provide an shortcut management method of the statistics data.
Keywords/Search Tags:ETL, workflow, optimization, cost model, statistics based optimizer
PDF Full Text Request
Related items