Study On Data Dependency_Based Data Quality Processing Techniques In Data Integration

Posted on:2013-08-12

Degree:Master

Type:Thesis

Country:China

Candidate:H X Miao

Full Text:PDF

GTID:2298330467478142

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

Data is the information carrier. With the constant development of information technology, corporations in every area have collected a large number of data. In order to make better use of data to analysis and make decision, we need the data to be correct and reliable. This calls for an urgent demand of good-quality data. This thesis studies the data quality problems in data integration and proposes data quality processing technquies based on data dependencies.At first, the research status of data quality is detailed introduced in this paper, including the definition, evaluation, classification of data quality and the improving methods. Especially, this thesis analyzes the data quality problem in data integration. Among the quality factors that have been proposed, aiming at the characteristics of data quality problem in data integration, this thesis analyzes two main ones:data consistency and data uniqueness.For data consistency, this thesis studies the data quality processing techniques based on conditional dependencies. The paper analyzes shortcomings of the existing research works about data consistency processing; defines the semantic assumption of inconsistent data against the data consistency based on conditional inclusion dependencies, and makes the corresponding repair rules. Based on this, the repair algorithms are proposed in the thesis to process the internal inconsistent data of data source and inconsistent data between data sources more effectively.For data uniqueness, this thesis studies the data quality processing technquies based on copy dependency. Data integration usually leads to lots of approximately duplicate records. Detecting and merging these records can make sure the uniqueness of data. So far, there are seldom studies about techniques for merging approximately duplicate records. Due to the phenomenon that there usually exists copy dependency between data sources in data integration, this paper determines the value of data among approximately duplicate records though utilizing copy dependency graph. In this way, this paper makes sure the uniqueness of data. Furthermore, this paper studies the performance optimization problems of uniqueness processing algorithm and proposes the performance optimization method.Finally, this thesis conducts a comprehensive experimental verification and analysis for the above processing techniques respectively. The results show that techniques proposed in this paper playing a significant role in improving the data quality in data integration.

Keywords/Search Tags:

Data quality, data consistency, data uniqueness, conditional dependencies, copydependency

PDF Full Text Request

Related items

1	Research On Data Consistency Maintenance Based On Content-Related Conditional Functional Dependencies
2	Research On Algorithms Of Big Data's Consistency Quality Analysis
3	Reseaerch On Detection And Repair Of Structure Data Availability Violation
4	Research Of Methods Of Data Cleaning For Hotel Entity Based On Edit Distance And Conditional Functional Dependencies
5	Research On Data Cleaning Method Based On Related Dependencies
6	Design And Implementation Of The Inconsistent Data Repairing Subsystem In The Data Cleaning System
7	Research On Duplicate Detection Of Data Quality In Big Data
8	Research On The Key Technologies Of Distributed Big Data Consistency Management
9	Reserch On Data Repairing Techniques Based On Editing Rules And Master Data
10	Data dependencies in the presence of difference