Font Size: a A A

Theoretical Research Of ETL And OLAP Based On Data Warehouse

Posted on:2009-12-25Degree:MasterType:Thesis
Country:ChinaCandidate:L PengFull Text:PDF
GTID:2178360245455206Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Data warehouse is used to store the considerable data. On one hand, building a data warehouse must be experienced ETL(Extracting\Transiting\Loading) process to attain a more comprehensive, accurate and high-quality data to provide quality assurance for decision-making ; On the other hand, accessing a large amount of data efficiently in the data warehouse need OLAP(On-Line Analytical Processing) tools to a more comprehensive and flexible display.In the aspect of ETL process, this paper focuses on the optimization of the process and improvements for approximate duplicate records detection method. According to the present new problems emerged in data warehouse -approximate duplicate data caused by huge increment of data lead to a big hidden affecting the quality of modern data warehouse, if we still follow the traditional ETL process to deal with this new situation, it must emerge problems of unclear phase mission, a lot of duplication work, interior quality and so on. In the light of this situation, this paper presents a framework - EICLF (Extracting\Integrating\Cleaning\Loading\Feedback) process for the optimizing of ETL process, decomposed the transiting task into two phases - integrating phase and the cleaning phase to improve the quality of the data loading in the data warehouse . In the light of the traditional ETL process has hadn't data feedback process for the wrong data source , this article will introduce data feedback into the stage to perfect ETL process. In addition, this paper researches on the approximate duplicate records, and analyses several algorithms commonly used, such as Nest Loop NL, Multi-Pass Sorted- Neighbor-hood MPN, Position Coding Method PCM, and provides an improved method - Records Division Method, which selecting the optimal field for division sort, which would gather the same records and discrete the different records in some extent. Experiments have proved that experienced EICLF process, data quality can be improved in greatly extent.In the aspect of OLAP, this paper researches on two index technology commonly used in current data warehouse - B-Tree index and bitmap index. Pointing out their limitations, analysing the bottlenecks encountered by bitmap index, and proposing an expansion form for bitmap index - identifier index, and comparing bitmap index and identifier index in performance to prove its superiority.Believing that the work this paper has made has a referrence effect in the aspect of building and displaying the data warehouse.
Keywords/Search Tags:Data Warehouse, Extracting\Transiting\Loading process, On-Line Analytical Processing, Bitmap index, Identifier index
PDF Full Text Request
Related items