Font Size: a A A

Research Of Top-k Query Processing On Uncertain Data

Posted on:2015-05-01Degree:DoctorType:Dissertation
Country:ChinaCandidate:W F LiFull Text:PDF
GTID:1318330467482944Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the development of data capture and process technology, uncertain datasets along with much noise, missing values, errors and inconsistencies are more and more. Traditional data management technologies are faced with unprecedented challenges. Uncertainty permitting becomes one of changes from small data to big data. Therefore, How to manage effectively uncertain datasets is becoming an urgent research task. Top-k query on uncertain dataset is one of most important query types. Even there are many researches about query semantic and query processing, there also many problem when applying. For example, there are too many diverse semantics, the efficiency of queries are too low and there are many difficulties when in-service use.About top-k queries on uncertain datasets, this paper is focuses on the problems how to capture data uncertainties, how to query unifily with various semantics and how to extend to top-p%queries. Related background and related work are reviewed in chapter1, and in chapter2to chapter5, each research problem is discussed in detail.1. Probability distribution capture in high volume data streamNowadays, the reduction of high volume timestamp-based data stream is lack of measurement of accuracy loss. We proposed a multi-treap parelly priority sampling method (MMP) with accuracy control, realized stochastic sampling reduction and capturing the approximation probability distribution with accuracy control. Form the characteristics of1-sample priority sampling principle, a type of modified treap structure which only reserved the right ridge to realize sampling is designed to improve efficiency. Meanwhile, reverse reservoir sampling method is designed to relize k-sample sampling flexibly. The relationship of precision of probability distribution and sample size is analysed to control the sample size of MMP algorithm. Therefore, MMP algorithm realized probability distribution capture with flexible precision control.2. Supporting various top-k queries on uncertain datasetsThe diversity of query semantics and process methods of top-k queries on uncertain datasets lead to wildly results and wasting of resources. A solution based on sharing position possibility distribution is proposed in this paper. Based on precomputing and storing the position possibility distribution of uncertain datasets, with some basic operations on position possibility distribution, top-k queries of value insensitivity on uncertain datasets like U-kRanks, Expected Rank, PT-k and Global-Topk can be calculated efficiently. However, the precomputing of position possibility distribution is the most time-comsuming procedure. A stochastic approximation method based on grouping strategy is proposed which can obtain position possibility distribution with high accarcy and with no need for unfolding the whole possible world space. Precomputiong position possibility distribution can separate the most time-cosuming compting and the real-time qurey, therefore, the response efficiency is improved. In addition, the basic operation on position possibility distribution is beneficial to extend to other semantics. Based on theoretical analysis and experimental result, the stochastic approximation method of precomputiong position possibility distribution based on grouping strategy can greatly improve the efficiency of value insensitive top-k queries on uncertain datasets. Meanwhile, the bad influence which was brought by "tie" can be avoid with this method.3. Top-p%quries on uncertain datasetsNowadays, when top-k query is extended to implement p percentile query on uncertain datasets, many unreasonable changes are exisit on semantics and process algorithms. Two semantics of p percentile query EU-pRank and R-PTp are defined and formulized based on analyzed the results of p percentile query and top-k query with various semantics. Under new semantics, existing process algorithms can not be used directly. Even these two semantics can be indirectly realized on position possibility distribution, it is a time-consuming approach. Though analyzing the recurrence relation of position possibilities between records, a diagonal recurrence method is proposed to avoid computing unnecessary position possibilities. When there are duplicated values, the mergence of duplicated values brings a shrinking of possible world number. An extending and merging method EU-pRank-EC and R-PTp-EC are proposed to realize p percentile query on uncertain datasets with duplicate values. Experments on real data and synthetic data proved, p percentile query on uncertain datasets with traditional method lead to analysis deviation, this situation can be rectitied by using EU-pRank or R-PTp. The query process method based on diagonal recurrence can improve efficiency greatly. When there are duplicate values, the extending and merging method EU-pRank-EC and R-PTp-EC can futher improve efficiency.4. Patent R&D decision analysis system based on uncertain factorsThere is no patent analysis research support analysis combined with market environment. The patent R&D decision analysis system in this paper not only considerate the result of patent content analysis, it also considerates and quantifies some market factors. Therefore, the usage of ranking on uncertain datasets in information retrival is verified. In R&D decision analysis, patent novelty, research difficulty and market expectation are quantified. If patent novelty and market expectation is regarded as the degree of returns and Research difficulty is regarded as the probability of the realization of returns, system can give decision support based on uncertain ranking criterion.
Keywords/Search Tags:top-k queries on uncertain datasets, timestamp-based data stream reduction, position probabilities distribution, multi-treap parallel priority sampling, value insensitivity, top-p%queries on uncertain datasets
PDF Full Text Request
Related items