Font Size: a A A

Inconsistent Data Query Processing

Posted on:2011-10-20Degree:DoctorType:Dissertation
Country:ChinaCandidate:A H WuFull Text:PDF
GTID:1118360305997328Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Data violating its integrity constratints are called inconsistent data. Although integrities are used to avoid inconsistency for long time, but because of this or that, inconsistent data still exist widely in real applications from data integration, data exchange, data mining, data extraction, science data management over relational database, to data integration and exchang on web by XML.Inconsistent data implies invalid information, and query answers over such data are also possible invalid. In this paper, we introduce a concept of Annotation Based Query Answer, and a method for its computation, which can answer queries on relational databases that may violate a set of functional dependencies.In this approach, inconsistency is viewed as a property of data and described with annotations. To be more precise, every piece of data in a relation can have zero or more annotations with it and annotations are propagated along with queries from the source to the output. To calculate query answers on an annotated database, we propose an algorithm to annotate the input tables, and define seven basic algebra operations so that annotations can be correctly propagated as the valid set of integrity constraint changes during query processing. We also prove the soundness and completeness of the whole annotation computing system. With annotations, inconsistent data in both input tables and query answers can be marked out down to attribute level, and preserved, instead of being filtered in most previous works. Thus this approach can avoid information loss.To apply it in real applications, we need a method to implement the approach of Annotation Based Query Answer which is compatible to traditional data management software and can be easily embedded into existing database applications. In this paper, we propose a method of query rewriting to compute Annotation Based Query Answer for any given SQL query, and optimize the algorithms in different way for different envioment. We do a series of experiments both on TPC-H database and synthesized database to test effectiveness and applicability of our approach, and developed a prototype of our system.XML is complex in data model and flexible in grammar, and W3C doesn't give standard on strict constraints. Addtionally, XML are often used for data exchange and data integration on web. Therefore, inconsistent XML happens more frequently than relational data does. We introduce a repair-based strategy to answer queries in this paper, whose key point is an effective algorithm to compute an optimum repair for inconsistent XML document. But getting an optimum repair is a NP complete problem, especially when XML documents violate both function dependency and key dependency. This paper proposes a cost-based heuristic algorithm, which can find a repair with lowest cost in polynomial time. It first scans the original XML document once to get the inconsistent data, computes general candidate repairs for each inconsistent data, and gets a whole document repair heuristically based on its cost. Our experimental evaluation show that even when XML documents is large, with dirty elements in high percent, and against many different constraints, the algorithm will run in less than O(n3) w.r.t. the size of inconsistent elements.
Keywords/Search Tags:inconsistent data, repair, XML data cleaning, integrity constraint, uncertain data, query processing
PDF Full Text Request
Related items