Font Size: a A A

Data Cleaning In Data Integration

Posted on:2011-04-12Degree:DoctorType:Dissertation
Country:ChinaCandidate:J LiuFull Text:PDF
GTID:1118330332469252Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Data integration is to collect data in various sources, format, and semantics, integration them physically or logically, and provde a unified view to access them. Due to large amount of data and the increasing complexity of business intelligence application requirements, it is hard to ensure the integrity, consistency and accuracy of data. It is error-prone and labor-intensive to develop data integration projects due to data quality issues.Integrity constraints provide user a way to define the data dependencies in a declarative way to ensure the consistency and there are sound theory basis to do implication analysis of integrity constraints.It is a hot area to induce and mine data quality rules based on constraint theories. This thesis targets on this problem in the integration scenario to present new method to automatically and efficiently detect and clean the data.First, we originally present a method to induce the data quality constraints for the data sources from the data quality constraints defined on the target database. The data quality in a data source may exceed the expectations of designers at the design time when validation and transformation rules are specified, and this will cause unsuccessful load of target database due to constraint violations or flush dirty data into the target database. Due to large amount of data, and there may need to transfer data between distributed servers, it is costly to debug the DIF by executing it. In this paper, we design a general framework for the problem, called Backwards Constraint Propagation (BCP), which automatically analyzes a DIF, generates data quality rules from the constraints defined in the DW, and propagate them backwards from target to sources. The derived data quality rules can be used to detect exceptional data in the data sources and help designers improve the DIFs. BCP supports most relational algebra operators and data transformation functions by defining constraint propageation rules. Case studies and experiments are provided to demonstrate the correctness and efficiency of BCP.Second, we present a method to automatically filter the inconsistent attirbutes from data sources based on virtual repair by NULL. Although integrity constraints can successfully capture data semantics, the actual data in the database often violates such constraints. When one DIF can be transformed to a relational algebra query, we can apply consistent query answering (CQA) to get an answer which is true in every minimal repair of the inconsistent database. It has been proved that for most constraints and queries CQA is a NP problem based on repairing by tuple deletions or tuple insertions. Furthermore, repairing by deleting tuples will also cause information losing. In this paper we present a new repair semantics named repairing with nulls, which replaces the inconsistent attribute values with nulls. To capture all the inconsistent attribute values, we study the transitivity of nulls and provide an algorithm to extend the original constraints. Based on repairing with nulls, there will be only one repair and CQA can be computed in PTIME by SQL query rewritings. Finally, we study the performance of our new approach for CQA by detailed experiments.Third, we research on enhancing the performance of data cleaning processes via automatically refactoring the structure of its data flows. First a set of operational semantics features are selected for annotating the operators in data flows and refactoring rules are defined to generate all candidate semantics equivalent data flows. Then a heuristic algorithm is described for accurately and quickly searching the data flow of minimal execution time by constructing a partially ordered set of data flows based on their cost estimation. To validate the framework, we apply it to mashups. Mashup tools usually allow end users quickly and graphically build complex mashups using pipes to connect web data sources into a data flow. Because end users are of varying degrees of technical expertise, the designed data flows may be inefficient and this will definitely increase the response time of mashups. Case study shows the framework is applicable to general mashup data flows without knowing complete operational semantics of their operators and the efficiency improvement is demonstrated by experiments.Finally, we research on model driven development method for data integration process and implement a development platform. The details of implementing our research work in the system are discussed.
Keywords/Search Tags:Data integration, data quality, integrity constraint, data warehouse, data cleaning, optimization
PDF Full Text Request
Related items