Domain-independent de-duplication in data warehouse cleaning

Posted on:2003-05-18

Degree:M.Sc

Type:Thesis

University:University of Windsor (Canada)

Candidate:Udechukwu, Ajumobi Okwuchukwu

Full Text:PDF

GTID:2468390011489572

Subject:Computer Science

Abstract/Summary:

Many organizations collect large amounts of data to support their business and decision-making processes. The data collected originate from a variety of sources that may have inherent data quality problems. These problems become more pronounced when heterogeneous data sources are integrated to build data warehouses. Data warehouses integrating huge amounts of data from a number of heterogeneous data sources, are used to support decision-making and on-line analytical processing. The integrated databases inherit the data quality problems that were present in the source databases, and also have data quality problems arising from the integration process. The data in the integrated systems (especially data warehouses) need to be cleaned for reliable decision support querying.; A major problem that arises from integrating different databases is the existence of duplicates. The challenge of de-duplication is identifying “equivalent” records within the database. Most published research in de-duplication propose techniques that rely heavily on domain knowledge. A few others propose solutions that are partially domain-independent. This thesis identifies two levels of domain-independence in de-duplication namely: domain-independence at the attribute level, and domain-independence at the record level. The thesis then proposes a positional algorithm that achieves domain-independent de-duplication at the attribute level. The thesis also proposes a technique for field weighting by data profiling, which, when used with the positional algorithm, achieves domain-independent de-duplication at the record level. Experiments show that the positional algorithm achieves more accurate de-duplication than the existing algorithms. Experiments also show that the data profiling technique for field weighting effectively assigns field weights for de-duplication purposes.

Keywords/Search Tags:

De-duplication, Technique for field weighting, Data quality problems, Data profiling, Positional algorithm achieves, Level the thesis, Heterogeneous data sources

Related items

1	Research And Implementation Of The Data Quality Control Methods In Integrating Heterogeneous Data Sources
2	The Design And Implementation Of Traffic Data Service System Based On Heterogeneous Data Source
3	Heterogeneous Data Sources Integration In Research And Application Of The Cleaning Strategy,
4	Design And Realization Of SyncML-based Synchronization Method For Heterogeneous Data Sources In Mobile Computing Environment
5	Uniform and high-level intelligent access to heterogeneous information sources
6	Research On Data Organization For Data De-duplication System
7	Research On Duplicate Record Detection Algorithms In Heterogeneous Data Sources
8	Design & Realization Of The Data Collection System With Heterogeneous Data Sources In Analyzing Station
9	Research On Ontology Based Retrieval Technology Between Heterogeneous Data Sources
10	Design And Implementation Of Heterogeneous Data Sources Integration Middleware