Font Size: a A A

The Etl Number Of Key Technologies,

Posted on:2007-05-22Degree:DoctorType:Dissertation
Country:ChinaCandidate:X F ZhangFull Text:PDF
GTID:1118360212984512Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
A Data Warehouse contains a large collection of data integrated from multiple distributed autonomous databases and other information sources. Extraction-Transformation- Loading (ETL) process is the processes to dealing with homogeneity, cleaning and loading problems in the building of data warehouse system. ETL processes are very important for building Data Warehouses and some papers mention that ETL processes costs 55% of the total costs of Data Warehouses runtime. So with the reduce of the cost in building, maintaining and running ETL processes, the budget of Data Warehouse can also be reduced.With the change of date sources, we need to change the ETL processes too. A good model to describe the ETL processes should make it easy for changing the ETL processes.The cost of designing an incremental ETL process is more expensive than the cost of designing a full ETL process. A Data Warehouse can be seen as a set of materialized views defined over the remote source data, and the ETL processes are used to maintain the materialized views. Using existing methods of incremental maintenance of materialized views, we can an incremental ETL process from the full ETL process. But in ETL processes, data cleaning must be used to improve the quality of data. And Relations Model can not describe data cleaning.After errors are removed, the cleaned data should also replace the dirty data in the original sources in order to give legacy applications the improved data too and to avoid redoing the cleaning work for future data extractions. The process is named Backflow of cleaned data. It is the last step of the ETL process. So with the reduce of the cost in running ETL processes, we can reduce the cost of running ETL process.The main contributions and research results in this thesis are of the following:1. An ETL expression logic is introduced to describe ETL processes. In the processes data extracting, transforming, loading and data cleaning are involved both. In the model we use an ETL process tree to describe the ETL process. There are two types of node, one is transform node, and another is relation node. Because a data cleaning rule must be a restriction of a relation node, we can change the rule without changing the ETL process tree. So we reduce the cost of maintaining the ETL process.2. Existing researches are focused on the incremental maintenance ofmaterialized views in such circumstances which involve the operators of selection, projection, join and aggregation but difference operators excluded. Since difference operators are used frequently in an ETL process, we discuss incremental maintenance of materialized views defined with difference operators in detail.3. Using existing methods of incremental maintenance of materialized views for reference, we put forward an approach to generate an incremental ETL process automatically from the full ETL process described by ETL tree.4. An algorithm named DPTI is given to find the dirty data in the increment of the original sources. Because the dirty data found by the increment ETL process is only in the increment, the algorithm only accesses the increment of the original sources. So the algorithm is more efficient than the algorithm for data lineage tracing which accesses the full data of the original sources.
Keywords/Search Tags:ETL, data warehouse, incremental maintenance, materialized views, self-maintenance, data lineage tracing
PDF Full Text Request
Related items