Font Size: a A A

Studies On Quality Evaluation And Lineage Of Uncertain Data

Posted on:2017-01-06Degree:DoctorType:Dissertation
Country:ChinaCandidate:L WangFull Text:PDF
GTID:1368330512454961Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the popularity of Web and data acquisition equipments, data grows exponentially without high quality assurance. Traditional data management assumes data is correct and credible, but it is not reasonable. Medical accidents, wrong company decision, false monitor-ing data have led to serious consequences.Improving data quality has important social and economic values due to accidents and economic loss reduction. Improving data quality is a continuous process, so we often adopt probabilistic databases to manage low quality data. Probabilistic databases are expansion of relational databases, which use probability to represent data confidence that is used for data quality measurement. Measuring data quality and deriving low quality source according to user query results are premise of continuous quality improvement. Hence, studies on uncer-tain data quality evaluation and derivation have great social and economic significance.Focusing on data with low quality, in this paper, we study how to improve quality evalu-ation precision of unstructured data, how to define attribute level lineage for query result der-ivation and confidence computation, how to model uncertain data with attribute level lineage. To be specific, this paper mainly includes the following four works.(1) Fact statement oriented description granularity improvement during quality evaluationAs unprecise description of unstructured data leads to ambiguity increasing the difficulty of probability computation, this paper proposes a method to find missing terms to improve the description granularity of unstructured data. It first collects missing term candidates with a search engine. Then it divides candidates into several groups according to correlations among them and selects the group with the highest possibility as missing terms. During this process, a clustering algorithm and a knowledge base are used for incorrect missing term elimination. For selected missing terms, it computes the occurrence probability at every position of corre-sponding fact statements, based on related data extracted from Web by the search engine, to predicate the inserted position. This method can improve the description granularity of fact statements and further improve the precision of confidence computation.(2) Uncertain data storage model with attribute level lineage supportAs existing uncertain data models have limitations to express uncertain data with attrib-ute level lineage, this paper propose a new model based on object deputy model. The new model transforms an uncertain tuple into multiple probabilistic objects. Through different probabilistic objects combination can get probabilistic deputy objects, which represent corre-sponding possible tuples of uncertain tuples. Probabilistic objects and probabilistic deputy objects are connected with bilateral links, through which and deputy rules can compute the value of attribute inherited from probabilistic objects for probabilistic deputy objects. It never needs to store inherited attribute values and can avoid a lot of redundant storage. In addition, through the bilateral links between probabilistic objects and probabilistic deputy objects, it makes the update of probabilistic objects timely reflect on their probabilistic deputy objects, which reduces the maintenance cost. Based on this model, we define a variety of data opera-tions and attribute level lineages obtained by them.(3) Result probability computation based on attribute level lineageAs tuple level lineage cannot accurately locate the sources of results generated by tuples containing multiple uncertain attributes, this paper defines attribute expression and utilizes it to construct lineage expression for attribute level derivation. When computing probabilities of result tuples, it proposes lineage transformation algorithm to guarantee the correctness. To accelerate probability computation, it proposes a share path table construction method after analyzing the factors of probability computation efficiency, and further precomputes the probability of atomic disjunction.(4) Result probability computation of uncertain data with dependenciesAs current methods for probability computation of result tuples never consider different kinds of data correlations and schema constraints, this paper discusses possible correlations and constraints among uncertain data and analyzes properties in "Probabilistic or-set-? table" model. To guarantee the correctness of joint probability computation, it defines con-straint-correlation graph to model different kinds of data correlations and schema constraints. Then it proposes the references for the combination of differnet kinds of data correlations and schema contraints. Utilizing the references can get the potential correlations according to the explicit correlations andconstraints. When computing the joint probability of multiple objects, it first infers possible correlations among the objects, and then adopts different elimination rules to eliminate some objects, which makes the joint probability computation to be possible.
Keywords/Search Tags:missing term, attribute level lineage, uncertain data, result probability computa- tion, data modeling
PDF Full Text Request
Related items