Font Size: a A A

Data Quality Assessment Model And Quality Propagation For Relational Database

Posted on:2008-01-20Degree:DoctorType:Dissertation
Country:ChinaCandidate:W D ChenFull Text:PDF
GTID:1118360278456528Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
In information era, with the widely spread of the hardware and software technology such as database and network, etc., information system is used to support the common tasks of organizations as well as the organization's strategy and decision making. The applications of information systems are extended from single application, single organization to multi-applications, different organizations, even between countries. With more and more data being used, database created by different type of data models, integrated from many sources, generated from different manners. Data are more complex and important than ever before, organizations are now becoming more and more difficult to know and hold the whole data from such complex ways. The data quality problem is becoming an increasing problem of organizations from which lead to error decision making and expense.The problem of data quality has long been a known question and till now is not solved well yet. The main problem of data quality is how to assess it scientifically and efficiently. In the paper, author adopted the widely accepted definition that data quality is the degree of fit for use for user, and between the whole data quality and the data view quality for user, the later is more direct and fit for use.Base on the studies of prior researcher, the paper focuses on the objective and subjective aspects of data quality assessment. At objective aspect, a data quality assess model at data item level is presented; the accuracy and completeness metrics are defined. Under the model and definitions, the propagation influence of basic relation algebra operations of Selection, Projection and Cartesian product were studied, and presented the quantitive propagate theorems of the operations. At subjective aspect, the paper studies the influence of the context factor on data quality assessment, and presents a model and method to assess data quality.The main contributions of the paper are summarized as follows:(1) Presents a data quality assessment model at data item level.Based on the Parssian's data quality model, author presents a data quality assessment model at data item granularity. After deeply analysis, the tuple's quality types can be category as five parts: accuracy, fuzzy, incorrect, mismember and tuple lost. These quality types are form a close set for relational algebra operations. The paper also presents a formal description of accuracy and completeness metrics from key and non-key attributes of a relation separately, and compares the model differences such as model styles and data quality characters etc. with Parssian's model.(2) Prove the data quality propagation theorems of accuracy and completeness metrics for Selection operation.After deeply analysis the quality quantification for relation, the difference of error and null-value rate before and after quantification for data quality is recognized, and reveals that the attribute error and null-value rate before quantification represents the distribution of data quality, while the attribute error and null-value rate after quantification represents the assessment of data quality. Then, under the assumption of random instantiation of correct and incorrect data items (Assumption 3.1~3.6), the quantitive relationship between attribute error and null-value rate before and after quantification are proved (Theorem 3.2, 3.4). The paper also discusses the quality propagation influence of selection operation on relation in different situations.At the situation of primary key Selection, The paper proves the quantitive relationship between error and null-value rate before and after quantification (Theorem 3.5, 3.7); Proves the accuracy and completeness remain unchanged for key attributes selection (Theorems 3.6, 3.8).At the situation of non-primary key Selection, The paper proves the quantitive relationship between attribute error rate and null-value before and after quantification (Theorem 3.9, 3.13, 3.18); Proves the propagate influences of selected attribute on other unselect attribute (Theorems 3.10, 3.14, 3.19); Finally, Proves the quantitive propagation influence of accuracy and completeness of Selection operation on relational database (Theorems 3.9, 3.11, 3.12, 3.15, 3.16, 3.17, 3.20).(3) Prove the data quality propagation theorems of accuracy and completeness metrics for Projection operation.For Projection Operation, Three situation were concerned and deeply analyses. First, for the situation of Projection operation contain all primary key attributes and the situation of select candidiate primary keys, the quantitive relationship between attribute error and null-value rate before and after quantification are proved (Theorem 4.1, 4.3, 4.4), then accuracy and completeness propagation quantification relationship are proved (Theorems 4.2, 4.5). Second, for the situation of Projection operation contain part of key attributes, all possibile situations were deeply analysised, problem are treated, suggestion are presented on how to deal with it.(4) Prove the data quality propagation theorems of accuracy and completeness metrics for Cartesian product operation.The propagation influences of accuracy and completeness are revealed. The quantitive relationship of attribute error and null-value rate and null value rate before and after quantification; Theorems are proved (Theorems 5.1, 5.2). Data Quality propagation of Cartesian product are proved (Theorems 5.3, 5.4).(5)Compare the model and conclusions of the attribute data quality model with Parssian's data quality model.Parssian's model is a strict model at tuple granularity. After detail compare and analysis, Parssian's model can be viewed as a vector from quantification aspect, and data quality propagation can be treated as the relation algebra operates on the vector. The attribute granularity model not only measure the key attributes quality, but also measure non-key attributes, and consider the influences between accuracy and completeness metrics jointly. The Paper also proves the quantitive relation of metrics between two models (Theorems 3.21, 3.22). At the situation when the relation contain only key attributes, the metrics defination of the two models are exactly same, which indicates that under the situation, Parssian metrics is a special case of the attribute model.For Selection operation, at the situation of selection condition with key attributes, the conclusions are the same of the two models.For Cartensian product operation, when the relation only has key attributes, the accuracy and completeness propagation conclusion is the same.(6)Assess data quality with consideration of context factor.For the subjective aspect of data quality assessment, the paper focuses the quality factor of context. Base on the preference model and preference structure of decision analysis method, a data quality assessment model and an algorithm to assess data quality is presented.The research reveals that the quality propagation for relational database at attribute granularity has meaning and values. The attribute granularity research compare with tuple granularity is more deep and meticulous, the conclusion is more unabridged.
Keywords/Search Tags:Data Quality, Data Quality Assessment, Data Quality Model, Data Quality Framework, Relation Algebra, Quality Propagation, Accuracy, Completeness, Context
PDF Full Text Request
Related items