Font Size: a A A

Research On Scientific Data De-duplication Model In Real-time Data Warehousing Environment

Posted on:2009-10-10Degree:MasterType:Thesis
Country:ChinaCandidate:H DingFull Text:PDF
GTID:2198360308479279Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In the information age, the correct decision is becoming to a powerful weapon for competition. To make the strategic business plans, process and the business decisions, companies and organizations are building their data warehouses. However, because data warehouse gets large amounts of data from a variety of data sources and the probability that some data sources include "dirty" data is very high. On the other hand, for data warehouses are used to make decision, data quality is crucial to avoid the wrong decision-making. Duplicate data is an important factor that affects the quality of data. The duplicate data will not only cause redundancy, increasing the amount of data and the burden of data warehouse, but also affect the analysis and decision-making seriously. Therefore, in a data warehouse, data de-duplication is one of the indispensable ways to improve data quality.Real-time data warehouse (RTDWH) is a new research direction of data warehouse technology. It refers to any changes of a data source will be automatically and immediately propaganded into the data warehouse. The development of RTDWH has also brought new challenges to guarantee the data quality. The changes of data sources of the RTDWHs are reflected to the data warehouse immediately, which is done by real-time ETL, which requires guaranteeing the quality of data to support the query and analysis at a real-time level. But the previous researches on data quality are mostly based on traditional data warehouse. Therefore, we need a new scheduling approach to ensure the credibility of the data in data warehouse more accurately and efficiently.This thesis presents a general de-duplication model for scientific data, introduces the characteristics of scientific data, gives a detail description of the de-duplication model, and proves that the traditional method of "sort and merge" isn't suitable for scientific data and shows the detail of algorithm of the model, and introduces the scheduling process and the architecture of the model. And then, it analyses the difficulties and problems of data quality assurance in the real-time data warehouse, presents de-duplication priority scheduling strategy, real-time scheduling strategy and ETL priority scheduling strategy in the real-time environment, and gives specific analysis for each. In the ETL priority scheduling strategy, evaluate indexes as De-duplication Busy Degree, Cumulative Delay are defined, and scheduling strategy based on time and Scheduling strategy based on events are presented, which apply SD2M in real-time data warehouse. Finally, the results of experiments show the high efficiency and stability of the model, and the scheduling strategy in real-time environment presented by this thesis is reasonable.
Keywords/Search Tags:data de-duplication, scientific data, architecture, real-time data warehousing, scheduling strategy
PDF Full Text Request
Related items