Font Size: a A A

Data Provenance Management And Similarity Query Over Uncertain Data

Posted on:2012-01-05Degree:DoctorType:Dissertation
Country:ChinaCandidate:M GaoFull Text:PDF
GTID:1488303356471334Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Appearing widely in various fields, inclusive of economy, military, logistic, fi-nance and telecommunication, et al., uncertain data has many different styles, such as relational data, semi-structure data, streaming data, and moving objects. accord-ing to scenarios and data characteristics. tens of data models have been developed. stemming from the core possible world model that contains a huge number of the possible world instances with the sum of probabilities equal to 1. However, the num-ber of the possible world instances is far greater than the volume of the uncertain database, making it infeasible to combine intermediate results generated from all of possible world instances for the final query results. Thus, some heuristic techniques, such as ordering, pruning, must be used to reduce the computation cost for the high efficiency.In this thesis, a comprehensive survey on the techniques for data management in uncertainty, and data provenance. However, traditional techniques cannot be adopted in uncertain data management because of some challenges in uncertain data management. Focusing on the challenges of uncertain data management and weak point of traditional techniques for it, we track the origin and value of uncertainty of data, evaluate the similarity of uncertain set for un-structured data, and study expected ranking top-k (shorted in ER-topk) query over uncertain data stream. The contributions of this thesis are summarized as follows:?Focusing on the uncertainty of data, we track the origin and value of uncer-tainty of data by how provenance at first. We propose an approach, named PHP-tree, to approximately model how-provenance upon probabilistic databases. we also show how to evaluate probability based on a PHP-tree. Compared with Trio style lineage, our approach is independent of intermediate results and can calculate the probability both cases of restricted and complete propagation of data provenance.?Based on uncertain set, we propose the expected set similarity operator based on the possible worlds model, over several mainstream similarity metrics, in-cluding Jaccard, Dice and Cosine. Our first work is to define the expected set similarity over probabilistic sets. Then we design exact and approximate algo-rithms for calculating them. In detail, we employ the dynamic programming to exactly calculate the expected set similarity in polynomial time and space consumption. Due to large cost both in time and space of exact algorithms, we also provide approximate solutions that are both time-and space-efficient with high approximate precision by Monte-Carlo.?Since diversity of set similarity of probabilistic sets, based on the possible worlds model, we propose the probability threshold set similarity operator, over several mainstream similarity metrics, including Jaccard, Dice and Cosine, to evaluate similarity of probabilistic sets. We also design exact algorithms for calculating them. In detail, we employ the dynamic programming to calculate the probability threshold set similarity in polynomial time and space consump-tion exactly. Furthermore, we propose probabilistic threshold similarity query algorithm. We provide a pruning rule in linear time and space for Jaccard, Dice, and Cosine. Due to large cost both in time and space of exact algo-rithms, we also provide approximate solutions that are both time-and space-efficient with high approximate precision by Monte-Carlo.?Based on the landmark model, we give an exact solution for ER-topk query over uncertain data stream. In our solution, all arriving tuples in the data stream are divided into two groups. One group contains candidate top-k tu-ples, i.e, the tuples having chance to belong to the query result, and the other contains the rest. We construct and maintain two structures, namely domGraph and probTree, to describe the two groups for efficiency. Since the linearity of expectation, our solution can provide answer of query in sub-linear time.In this thesis, our work focus on the query processing over uncertain data. Mainly work consist of data provenance management and similarity query over un-certain data. Finally, we verify the efficiency and scalability of our proposed algo-rithms by lot of experiments.
Keywords/Search Tags:Uncertain Data, Data Provenance, Similarity Query, Expected Set Similarity, Probabilistic Threshold Set Similarity, Data Stream, Top-k Query, Similarity Join Query
PDF Full Text Request
Related items