Font Size: a A A

Research Of Preference Queries On Probabilistic Database

Posted on:2012-07-09Degree:DoctorType:Dissertation
Country:ChinaCandidate:X W WangFull Text:PDF
GTID:1268330392973812Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Uncertainty is inherent in the real world and the precision of the query and analysisis the major concern in many modern applications like business data management,finance analysis, sensor, RFID, and GIS, etc. However, Traditional databases techniqueswhich are based on deterministic theory are not prepared to handle imprecise data. Theprobabilistic database techniques which are based on probability theory can manageuncertain data directly and emerges as a hot field in the database community. In theapplication layer of databases preference queries are used to fulfill the diverserequerment of individual users, among which skyline and top-k query are the mostimportant ones.This dissertation focuses on efficient and general solutions for preference querieson probabilistic database, and addresses several challenging problems of probabilisticskyline and probabilstic top-k queries. The main contributions can be concluded asfollows:(1) Non-indexing probabilistic skyline algorithm. Existing probabilistic skylinealgorithms are generally based on R-tree indices, which are not general enough.Non-indexing algorithms, which require no preprocessing or index structures, have thewidest applicability and are necessary complements to the indexing algorithms. Twonon-indexing algorithms that based on in memory R-trees and ordered list respectivelyare proposed. The basic idea is to estimate the skyline probability of future objects fromthe objects accessed before. Experimental results show that both algorithms outperformthe naive algorithm, and are complementary to each other.(2) Distributed probabilistic skyline algorithm. In practical applications, data isoften distributed across multiple nodes, and applications use distributed queries toobtain the holistic results. The query delay mainly depends on the communicationoverhead, and it is improtant to design communication efficient distributed algorithms.A framework based on summary sharing is first proposed, and then two specificalgorithms namely VOS and GRID are proposed respectively. The VOS algorithm usesvirtual objects as the summary of data distribution. In order to reduce additionalcommunication cost caused by summary sharing, a distance based compression schemeis further proposed. GRID algorithm is proposed to overcome the shortage of datadependency on data distribution of the VOS algorithm. In the GRID algorithm, eachobject is mapped to a cell of a regular grid, and a compression scheme of grid isproposed to reduce the additional communication cost of grid sharing. Experimentalresults show that, VOS and GRID outperform the existing algorithm in communicationcost, and the former is more suitable for independent like or high dimentional datasets,while the latter is suitable for anti-correlated datasets. (3) The probabilistic top-k indexing scheme. There are a collect of semantics of theprobabilistic top-k query because it needs to rank tuples taking into account both theprobability and scoring function. Different query algorithms are designed for each querysemantic, and there is lack of indexing schemes applicable to a class of important querysemantics. Two layer indexes that can be used by a collect of important probabilistictop-k queries are proposed. The first one is based on the concept of skyline, the secondone is based on dominate frequency, and is proposed to achieve better robustness.Morever, an efficient indexing building algorithm is proposed. Experimental resultsshow that both indexing schemes reduce the objects accessed by probabilistic top-kqueries, and the latter is more robust.(4) Probabilistic reverse top-k query. Above research topics all consider the queryevaluation problem from the perspective of users. The reverse top-k query is importantin business analysis because it helps manufactures learn preferences from the data oftop-k preference. Existing reverse top-k queries assume the underlying data is certain,however, in the real-world applications data is often imprecise because of dataintegration. A reasonable definition of probabilistic reverse top-k query is first proposed,and then an efficient query algorithm based on materialized views is presented. Thealgorithm partitions the data space to a grid, and stores the pre-computing query resultsof grid vertices. Subsequent queries use the materialized set preferences to avoid fullscan. Experimental results show that the query algorithm reduces the number ofpreferences need to calculated and scales well in cardinality of preference set.In summary, this dissertation focuses on the most important preference queries ofprobabilistic database, and provides efficient and general solutions. These works haveacademic and practical value for advancing the theory and practicability of probabilisticdatabase.
Keywords/Search Tags:Uncertain data, probabilistic database, preference query, top-k, skyline
PDF Full Text Request
Related items