Font Size: a A A

Study On The ETL Technology In Distributed Data Warehouse

Posted on:2010-07-17Degree:MasterType:Thesis
Country:ChinaCandidate:L ZhangFull Text:PDF
GTID:2178360272985314Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the scale of the enterprises expending, the operations in the enterprise spread to many regions, the operational pattern of many enterprises form a kind of distributed management structure. Besides, as a result of history, geography, and economy and so on, there are many transaction processing systems which are self-governed and not compatibility in the enterprises, we need to integrate these data which are distributed in these systems in order to offer uniform data views to the decision-makers. Distributed data warehouse comes into being. The process of extract, transform and load (ETL) data is an important tache in establishing data warehouse. Therefore, the ETL technology is a hotspot in the research of distributed data warehouse all the time.Firstly, the concepts of data warehouse and ETL are introduced, and the importance of ETL in the process of building data warehouse is pointed out. The architecture, classify and development strategies in distributed data warehouse are discussed. The contract of ETL and distributed ETL are emphasized on. Besides, the primary problems in distributed ETL are pointed out. The problems include both maintaining the data consistency and the efficiency of data transforming.Secondly, in the distributed data warehouse, every local data warehouse is a self-governed ETL node. Because of the existing of data copies, the distributed ETL has multi-targets. If the traditional ETL architecture is still used in the distributed data warehouse now, there will be inconsistent data in local data warehouse as a result of central ETL architecture. After analyzing the deficiency of central ETL architecture, an improved distributed ETL architecture-ETLM is advanced to resolve the problem of data consistency in distributed ETL better.Thirdly, the distributed data warehouse has many local ETL nodes; every node will process plenty of data. After being transformed, the data need to be loaded on multi-local data warehouses. In this situation, the traditional technology of data extracting and transforming will exist many deficiencies, especially on response time and transform efficiency. According to the demand of OLAP and DM, a new optimized strategy in executing distributed ETL is advanced. The strategy is based on the combine of data segmentation strategy and load balancing technology. ETL technology and distribution technology are used in the strategy. The strategy makes up the disadvantage of low efficiency when executing distributed ETL. The globe efficiency of distributed data warehouse will be improved after applying the strategy.Lastly, after solving the consistency maintaining problem in distributed ETL, the design scheme of distributed ETL system is discussed. Besides, the detailed design of ETLM system and the main function models are illuminated in detail. In order to make the theory used to practice, an experiment simulation has been implemented in an application instance. According to the experiment, the ETLM method is testified feasibility.
Keywords/Search Tags:Distributed ETL, Architecture, Data Consistency, Optimization Strategy, Data Partition, Load Balancing
PDF Full Text Request
Related items