Data Provenance Management And Similarity Query Over Uncertain Data

Posted on:2012-01-05

Degree:Doctor

Type:Dissertation

Country:China

Candidate:M Gao

Full Text:PDF

GTID:1488303356471334

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

Appearing widely in various fields, inclusive of economy, military, logistic, fi-nance and telecommunication, et al., uncertain data has many different styles, such as relational data, semi-structure data, streaming data, and moving objects. accord-ing to scenarios and data characteristics. tens of data models have been developed. stemming from the core possible world model that contains a huge number of the possible world instances with the sum of probabilities equal to 1. However, the num-ber of the possible world instances is far greater than the volume of the uncertain database, making it infeasible to combine intermediate results generated from all of possible world instances for the final query results. Thus, some heuristic techniques, such as ordering, pruning, must be used to reduce the computation cost for the high efficiency.In this thesis, a comprehensive survey on the techniques for data management in uncertainty, and data provenance. However, traditional techniques cannot be adopted in uncertain data management because of some challenges in uncertain data management. Focusing on the challenges of uncertain data management and weak point of traditional techniques for it, we track the origin and value of uncertainty of data, evaluate the similarity of uncertain set for un-structured data, and study expected ranking top-k (shorted in ER-topk) query over uncertain data stream. The contributions of this thesis are summarized as follows:?Focusing on the uncertainty of data, we track the origin and value of uncer-tainty of data by how provenance at first. We propose an approach, named PHP-tree, to approximately model how-provenance upon probabilistic databases. we also show how to evaluate probability based on a PHP-tree. Compared with Trio style lineage, our approach is independent of intermediate results and can calculate the probability both cases of restricted and complete propagation of data provenance.?Based on uncertain set, we propose the expected set similarity operator based on the possible worlds model, over several mainstream similarity metrics, in-cluding Jaccard, Dice and Cosine. Our first work is to define the expected set similarity over probabilistic sets. Then we design exact and approximate algo-rithms for calculating them. In detail, we employ the dynamic programming to exactly calculate the expected set similarity in polynomial time and space consumption. Due to large cost both in time and space of exact algorithms, we also provide approximate solutions that are both time-and space-efficient with high approximate precision by Monte-Carlo.?Since diversity of set similarity of probabilistic sets, based on the possible worlds model, we propose the probability threshold set similarity operator, over several mainstream similarity metrics, including Jaccard, Dice and Cosine, to evaluate similarity of probabilistic sets. We also design exact algorithms for calculating them. In detail, we employ the dynamic programming to calculate the probability threshold set similarity in polynomial time and space consump-tion exactly. Furthermore, we propose probabilistic threshold similarity query algorithm. We provide a pruning rule in linear time and space for Jaccard, Dice, and Cosine. Due to large cost both in time and space of exact algo-rithms, we also provide approximate solutions that are both time-and space-efficient with high approximate precision by Monte-Carlo.?Based on the landmark model, we give an exact solution for ER-topk query over uncertain data stream. In our solution, all arriving tuples in the data stream are divided into two groups. One group contains candidate top-k tu-ples, i.e, the tuples having chance to belong to the query result, and the other contains the rest. We construct and maintain two structures, namely domGraph and probTree, to describe the two groups for efficiency. Since the linearity of expectation, our solution can provide answer of query in sub-linear time.In this thesis, our work focus on the query processing over uncertain data. Mainly work consist of data provenance management and similarity query over un-certain data. Finally, we verify the efficiency and scalability of our proposed algo-rithms by lot of experiments.

Keywords/Search Tags:

Uncertain Data, Data Provenance, Similarity Query, Expected Set Similarity, Probabilistic Threshold Set Similarity, Data Stream, Top-k Query, Similarity Join Query

PDF Full Text Request

Related items

1	Research On Similarity Query Over Sequence Data
2	Trajectory Data Oriented Function Join And Similarity Search Algorithm
3	LSH-based Similarity Search Of Web Data
4	Aggregate Queries On Constrained Probabilistic Similarity Join Pairs
5	Efficient Similarity Join Over Probabilistic Data Streams Based On Earth Mover's Distance
6	Research On Complex Query Processing In Spatiotemporal Databases
7	Query Processing And Optimization Of Unstructured Data
8	Research On Similarity Query And Pattern Mining Algorithms Over Data Stream
9	Similarity Query And Optimization Over Data Streams
10	Research On Uncertain Time Series Similarity Matching