Font Size: a A A

Research On Distributed Parallel Skyline Queries Over Uncertain Data

Posted on:2014-09-09Degree:DoctorType:Dissertation
Country:ChinaCandidate:X Y LiFull Text:PDF
GTID:1108330479979529Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Uncertain data as a special type of data widely exist in various applications like sensor networks, RFID networks, financial data analysis, location-based services and mobile object management. The skyline query over uncertain data plays an important role in the fields of information retrieval, data mining, decision-making and environment monitoring, which emerges as a hot research topic in database community. As the widespread popularity of distributed applications with uncertainty, the current applications of uncertain skyline queries gradually expand to the distributed applications. For the skyline query over geographically distributed uncertain data sets, the current challenge lies in returning the query results efficiently and progressively by exploring pruning strategies for optimizing the distributed query processing, in order to improve the efficiency of the distributed uncertain skyline queries. With the recent rise and development of uncertain data streaming applications, efficient processing the skyline queries over uncertain data streams becomes a serious problem for current skyline query research. The uncertain streaming data arrives continuously with high speed and the size of the sliding window user concerned is increasing, resulting in existing centralized query methods are hard to meet the requirements of the query efficiency in streaming applications. The advance and widely used of current distributed computing environments like data centers, provides favorable conditions for realizing distributed parallel skyline queries over uncertain data streams. For the skyline query over high-speed uncertain data streams, the current challenge lies in how to the parallel query processing by making full use of distributed computing environments, in order to improve the efficiency of the skyline queries over uncertain data streams. These research challenges demonstrate that, the study of the distributed parallel skyline queries over uncertain data has extremely important significance in real life, and becomes an inevitable trend of the current research. According to the aforementioned challenges, this dissertation deeply studies the techniques for the distributed parallel skyline queries over uncertain data sets and uncertain data streams, respectively.To address the problem of expensive communication overhead of existing distributed probabilistic skyline query methods that stems from the inefficient pruning, this dissertation proposes a grid-filtering based distributed probabilistic skyline query scheme named GDPS. The query process of GDPS includes the stages, i.e., the preprocessing stage based on grid summary pruning and the processing stage based on iterative pruning. In former stage, the local nodes split the data space into grid cells and collect the global grid summary based on the cells. Then, the local nodes can significantly filter the unqualified items that cannot be the results directly with the global grid summary. In the latter stage, on one hand, the coordinator uses the historical information to prune the candidate items as more as possible, and sends the items with maximum dominance capability to the local nodes. On the other hand, each local node constantly updates the temporary skyline probabilities of the items, and sends the tuples with the largest probabilities to the coordinator to enhance the pruning capability of the candidate items. Extensive experimental results demonstrate that, compared with existing methods, GDPS method not only can progressively return the complete correct results, but also can significantly reduce the communication overhead.To overcome the shortages that existing skyline query techniques are difficult to model and cannot efficiently process the distributed interval skyline queries, this dissertation proposes an iterative-feedback based distributed interval skyline query scheme named DISQ. The scheme first effectively models the problem of interval skyline query, and adopts a four-phase iterative-feedback mechanism to optimize the query processing. Specifically, each local node constantly updates the temporary interval skyline probabilities of the interval tuples according to the feedback information from the coordinator, and directly prunes the tuples with the probabilities lower than the threshold. Meanwhile, each local node selects and sends the most representative tuples and their probabilities information to the coordinator, in order to optimize the pruning efficiency. Besides, each local node chooses the best optimal number of tuples returned to the coordinator, in order to reduce the communication overhead further. As for the coordinator, on one hand, it constantly collects and selects the outstanding tuples to maximize the pruning efficiency of the feedback tuples. On the other hand, it makes use of historical information to prune the candidate feedback tuples, in order to improve the selection of the feedback tuples and reduce the number of feedback tuples. Extensive experimental results reveal that, compared with existing methods, DISQ method not only can effectively model the problem of distributed interval skyline query and progressively return the correct results, but also can greatly reduce the communication overhead.To solve the problem that existing distributed parallel processing model such as Map Reduce cannot support the parallel skyline queries over uncertain data streams, this dissertation proposes a sliding-window partitioning based distributed parallel model named WPS. To realize the parallel query processing, WPS partitions the global sliding-window into multiple local windows, and maps all the local windows to the corresponding compute nodes. Moreover, WPS adjusts the sliding granularity adaptively according to the relationship among the arrival rate, processing rate and the memory capacity, which is modeled and analyzed with the queuing theory. Besides, the size of each local window in WPS is set according to the processing capability of the compute node, in order to improve the load balance on each compute node. Specifically, in order to adapt to a variety of parallel processing environments and query requirements, WPS implements four kinds of streaming mapping strategies, including centralized mapping strategy, alternate mapping strategy, distributed mapping strategy, and angle partitioning strategy. The centralized strategy maintains the global window in each compute node, and does not need to communicate with others, which is more suitable for the bandwidth-constrained environments. The alternate strategy sequentially and completely updates the local windows in the compute nodes in the alternate manner, which can reduce the dynamic changes of the local windows and is more suitable for high-bandwidth network environments. The distributed strategy maps the arriving items to each compute node alternately and individually, which can maximize the parallel processing efficiency and achieve well load balancing performance. The angle partitioning strategy maps the compute nodes according to their angular coordinates, which can optimize the query efficiency by strengthening the dominance relationship between the items and is more suitable for high-bandwidth environments and the applications that do not require complete load balance. Extensive experimental results demonstrate that, compared with existing methods, the distributed parallel methods implemented with the WPS model can achieve much better query performance, and also can maintain high efficiency and load balancing performance against different update granularities, data dimensionality and sliding window sizes.To address the problem that existing methods for skyline queries over uncertain data streams are difficult to retrieve the skylines fast over large sliding-window in high-throughput streaming environments, this dissertation proposes a two-level optimization based distributed parallel skyline query scheme named PSS. In PSS method, the sliding-window partitioning based WPS model are used to implement the basic parallel query framework, and the optimizations among the compute nodes and within compute nodes are further adopted to improve the parallel query processing. Firstly, the streaming mapping strategy for the new arrivals is used to optimize the organization of the compute nodes, in order to establish the dominance relationship between the streaming items maintained in different compute nodes, which can significantly reduce the dominance tests between the items. Secondly, the grid index is adopted in the each compute node to optimize the computation, including the dominance test between the items, the calculation and updating for the skyline probabilities of the candidate items. Moreover, a Z-order curve based strategy is used to efficiently manage the large number of grid cells. Specifically, the monotony of the Z-order list is adopted to optimize the dominance test between the grid cells.The deal with the problem that the query results may be incorrect and interrupted, due to various faults occurred in the process of parallel executing the skyline queries over uncertain data streams, this dissertation proposes a replication based fault-tolerant distributed parallel skyline query scheme named FTPS. In the FTPS scheme, on one hand a parallel processing framework is developed based on the WPS parallel model and a two-level optimization strategy, in order to realize the parallel skyline query processing over uncertain data streams. On the other hand, various fault-tolerant strategies based on replication are implemented and integrated to the parallel query framework, in order to realize the fault-tolerant efficient parallel query processing. Specifically, the compute nodes in FTPS are also regarded as the replication nodes, and all the replicas in each node are arranged by levels. Thus, the lost data can be recovered effectively and quickly by selecting the replicas with high priority. Moreover, the processes of fault detection, data recovery and query recovery are throughout the whole query updating process, in order to reduce the communication and computation overhead and achieve rapid fault-tolerant parallel query processing. Extensive experimental results demonstrate that, FTPS method can effectively and fast detect the faults and recover the parallel query processing, not only has a high efficiency when no failures occur or a single node fails, but also can maintain a high processing rate and meet the query requirement when the number of failures increases.
Keywords/Search Tags:Uncertain Data, Data Streams, Skyline Queries, Distributed Parallel Queries, Fault-Tolerant Queries
PDF Full Text Request
Related items