Font Size: a A A

Research On Top-k Queries Optimizing Algorithm On Uncertain Dataset

Posted on:2014-11-05Degree:MasterType:Thesis
Country:ChinaCandidate:Y YuFull Text:PDF
GTID:2268330425991828Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Top-k query technology is used widely, which is to find out the highest k result according to the user-defined scoring function. In the traditional deterministic database, Top-k query has its clear semantics, and researchers have proposed various kind of optimized processing algorithm. However, with the development of data acquisition and data processing technology, in more and more application field uncertain data is discovered, such as Wireless Sensor Networks (WSN), RFID system, mobile computing, etc. Uncertain data is getting more and more attention from the academic and become a hot research issue.In traditional database, Top-k query processing only considers the order of scoring function value. However, Top-k query on uncertain data should consider both the scoring function value and the uncertainty. So, Top-k query technology on deterministic database can’t be straightly immigrated to the uncertain database. In previous work, researchers have proposed several Top-k query semantics on uncertain data. However, most of them doesn’t consider query optimizaiton problem in some specific semantics. In addition, the current uncertain data management and Top-k query processing method are always based on centralized database or dataflow. In practice, more uncertain data is derived from distributed systems. If centralized mthod for Top-k query processing is adopted to distributed uncertain dataset, It means each node must report all its data to the sink node, which will bring greate cost of communication, storage and time delay. Actually, Top-k query has a significant character:the result of Top-k query is at a small fraction of the overall dataset. So, in some distributed environment with strict resource constraint, centralized algorithm will result in great waste of communication and computation resources.Based on the analysis above, we can conclude that Top-k queries not only on centralized uncertain dataset but also in distributed environments are great challenges to the academic. In this thesis, we research on Top-k queries including its semantics and optimization problem on uncertain data. The main work is summarized as follows:Firstly, we propose the Determinate Minimum Scan Scope for U-Topk Algorithm (MSS4U-Topk). the scale of possible world was greatly reduced through the MSS4U-Topk algorithm.. In addition, as the preprocessing of the U-Topk, the MSS4U-Topk algorithm can determine the scan scope of the U-Topk, and further determine the size of the possible world. What’s more, it also provides important evidence for the selection of query algorithms.Secondly, we propose the optimized APT4U-Topk algorithm for U-Topk queries on the attribute-level uncertainty data, the APT4U-Topk algorithm and the possible world aggreagtion probability threshold concept are proposed. When the probability of possible world equals to the threshold, the subsequent calculation of possible world will less than the threshold. Therefore, we can terminate the APT4U-Topk algorithm, and find out the final answer quickly. Further, we apply the APT4U-Topk algorithm to the distributed environment, and propose the DAPT4UTop-k algorithm. DAPT4U-Topk algorithm can reduce the communication overhead effectively, because this algorithm avoids sending all local tuples of each node. But for some datasets, the sensornodes still need to send almost its all localdata to sink node, which leading to the large overhead of communication and the high time complexity.For the phenomenon that U-Topk query need to expansion all possible world for some datasets, and fails to return the result quickly, we propose the MPUTop-k algorithm in last part. The semantic of the algorithm is to return the Topk vector whose the probability is largest in the possible world.The MPUTopk algorithm has more value for the real applications because it don’t need to caliculate all probabilities of the possible world. And then, we apply the MPUTop-k algorithm to distributed system, and proposethe DMPUTop-k algorithm. DMPUTop-k algorithm can be applied to multiple hops distributed environment due to the result derived by the global MPUTop-k algorithm is same to that derived by the local MPUTop-k algorithm.Specially, we also prove that when the probability of possible world is not less than0.5, the result of MPUTop-k is equals to the result U-Topk. This conclusion provides an approximate method for U-Topk.This thesis has carried on the detailed processing illustration and algorithm descriptions, and it includes the necessary theoretical proof using to illustrate the correctness of algorithm. The experiment executed on the real dataset and simulated dataset, and verifies the performance of algorithm.
Keywords/Search Tags:uncertainty data, centralized, distributed, Top-k query, U-Topk query
PDF Full Text Request
Related items