Research On Scientific Data De-duplication Model In Real-time Data Warehousing Environment

Posted on:2009-10-10

Degree:Master

Type:Thesis

Country:China

Candidate:H Ding

Full Text:PDF

GTID:2198360308479279

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

In the information age, the correct decision is becoming to a powerful weapon for competition. To make the strategic business plans, process and the business decisions, companies and organizations are building their data warehouses. However, because data warehouse gets large amounts of data from a variety of data sources and the probability that some data sources include "dirty" data is very high. On the other hand, for data warehouses are used to make decision, data quality is crucial to avoid the wrong decision-making. Duplicate data is an important factor that affects the quality of data. The duplicate data will not only cause redundancy, increasing the amount of data and the burden of data warehouse, but also affect the analysis and decision-making seriously. Therefore, in a data warehouse, data de-duplication is one of the indispensable ways to improve data quality.Real-time data warehouse (RTDWH) is a new research direction of data warehouse technology. It refers to any changes of a data source will be automatically and immediately propaganded into the data warehouse. The development of RTDWH has also brought new challenges to guarantee the data quality. The changes of data sources of the RTDWHs are reflected to the data warehouse immediately, which is done by real-time ETL, which requires guaranteeing the quality of data to support the query and analysis at a real-time level. But the previous researches on data quality are mostly based on traditional data warehouse. Therefore, we need a new scheduling approach to ensure the credibility of the data in data warehouse more accurately and efficiently.This thesis presents a general de-duplication model for scientific data, introduces the characteristics of scientific data, gives a detail description of the de-duplication model, and proves that the traditional method of "sort and merge" isn't suitable for scientific data and shows the detail of algorithm of the model, and introduces the scheduling process and the architecture of the model. And then, it analyses the difficulties and problems of data quality assurance in the real-time data warehouse, presents de-duplication priority scheduling strategy, real-time scheduling strategy and ETL priority scheduling strategy in the real-time environment, and gives specific analysis for each. In the ETL priority scheduling strategy, evaluate indexes as De-duplication Busy Degree, Cumulative Delay are defined, and scheduling strategy based on time and Scheduling strategy based on events are presented, which apply SD2M in real-time data warehouse. Finally, the results of experiments show the high efficiency and stability of the model, and the scheduling strategy in real-time environment presented by this thesis is reasonable.

Keywords/Search Tags:

data de-duplication, scientific data, architecture, real-time data warehousing, scheduling strategy

PDF Full Text Request

Related items

1	Research On Scientific Data De-duplication Model In Real-time Data Warehousing Environment
2	Design And Implementation Of Real-time Data Extraction Mechanism In Data Warehousing
3	Research On Data De-duplication Based Real-time Backup And Recovery System
4	Study On Key Techniques Of The Real-Time Data Warehouse Based On Mapreduce Architecture
5	Research On Key Techniques Of Data Warehousing And ETL For Multi-type Data Sources
6	Research And Implementation On Query And Update Scheduling In Real-Time Data Warehouses
7	Research On The Qos-based Scheduling Between Updates And Queries In Real-time Data Warehouse
8	Research On The QoS-Based Scheduling Between Updates And Queries In Real-Time Data Warehouse
9	A Study On Data Broadcast Strategy In Mobile Real Time Database System
10	Research On Real-time Data Broadcast Scheduling And Index Organizing Strategy