Font Size: a A A

Study On Data Dependency_Based Data Quality Processing Techniques In Data Integration

Posted on:2013-08-12Degree:MasterType:Thesis
Country:ChinaCandidate:H X MiaoFull Text:PDF
GTID:2298330467478142Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Data is the information carrier. With the constant development of information technology, corporations in every area have collected a large number of data. In order to make better use of data to analysis and make decision, we need the data to be correct and reliable. This calls for an urgent demand of good-quality data. This thesis studies the data quality problems in data integration and proposes data quality processing technquies based on data dependencies.At first, the research status of data quality is detailed introduced in this paper, including the definition, evaluation, classification of data quality and the improving methods. Especially, this thesis analyzes the data quality problem in data integration. Among the quality factors that have been proposed, aiming at the characteristics of data quality problem in data integration, this thesis analyzes two main ones:data consistency and data uniqueness.For data consistency, this thesis studies the data quality processing techniques based on conditional dependencies. The paper analyzes shortcomings of the existing research works about data consistency processing; defines the semantic assumption of inconsistent data against the data consistency based on conditional inclusion dependencies, and makes the corresponding repair rules. Based on this, the repair algorithms are proposed in the thesis to process the internal inconsistent data of data source and inconsistent data between data sources more effectively.For data uniqueness, this thesis studies the data quality processing technquies based on copy dependency. Data integration usually leads to lots of approximately duplicate records. Detecting and merging these records can make sure the uniqueness of data. So far, there are seldom studies about techniques for merging approximately duplicate records. Due to the phenomenon that there usually exists copy dependency between data sources in data integration, this paper determines the value of data among approximately duplicate records though utilizing copy dependency graph. In this way, this paper makes sure the uniqueness of data. Furthermore, this paper studies the performance optimization problems of uniqueness processing algorithm and proposes the performance optimization method.Finally, this thesis conducts a comprehensive experimental verification and analysis for the above processing techniques respectively. The results show that techniques proposed in this paper playing a significant role in improving the data quality in data integration.
Keywords/Search Tags:Data quality, data consistency, data uniqueness, conditional dependencies, copydependency
PDF Full Text Request
Related items