Font Size: a A A

The Research On Optimization Of ETL Process And Incremental Data Extraction

Posted on:2012-08-12Degree:MasterType:Thesis
Country:ChinaCandidate:Q ShuFull Text:PDF
GTID:2248330395985447Subject:Computer technology
Abstract/Summary:PDF Full Text Request
As the key component of data warehouse, ETL(Extraction Transformation Loading) lives the entire life cycle of data warehouse and takes on responsibility of integrating data. This thesis does research and optimization works on two aspects of ETLOn the one hand, the data quality of ETL directly affects the analysis of upper decision supporting system and only correct data can produce correct result, so how to insure data quality of ETL is very important. By analyzing the incremental data integration based on trigger, this thesis found that traditional method’s concurrent operations on snap table may lead records in snap table to be invalid and can not correctly reflect the update history of data, so data at source will differ from data at destination and therefore affect the data quality of ETL. Referring to the concept of synchronization in operation system, this thesis synchronized concurrent operations by adding lock field to snap table. As a result, the records of snap table are guaranteed to be correct and data quality of ETL under incremental data integration based on trigger is also improved.On the other hand, data warehouse needs ETL to be efficiency to guarantee the timeliness of data and in order to satisfy the complicated and changeable business requirement, ETL has to be agile on applicability and development. By deeply researching SEDA(Staged Event-Driven Architecture) and ETL process, this thesis constructs ETL process on SEDA. Firstly, comparing to multithreaded ETL, SEDA ETL is proved to be as efficiency as multithreaded ETL and provide ways for precisly tuning, data priority supporting and load controlling, greatly improves applicability of ETL. Second, SEDA ETL uses stage as basic execution unit instead of whole ETL process. The modularity of stage makes the complicated resources management transparent, end users only need to concentrate business logic when do secondary development. At the same time, the reusablity of stage provides the basis for rapidly constructing ETL application and shortening the development times of ETL, reducing the development cost.
Keywords/Search Tags:Extraction Transformation Loading, Staged Event-Driven Architecture, Incremental Data Extraction
PDF Full Text Request
Related items