Font Size: a A A

Research On Key Technologies Of Distributed Rank-aware Query Processing

Posted on:2007-02-14Degree:DoctorType:Dissertation
Country:ChinaCandidate:B DengFull Text:PDF
GTID:1118360215470555Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In recent years, with the rapid development of the information technology, information resources have been greatly enriched. How to retrieve the small amount of the most valuable information from the massive data quickly has become a great challenge for the database field. In this background, there have been two new query branches raised in the database field in the past few years, which are the top-k query and the skyline query. The top-k query is to find the k objects with the highest aggregate values, according to a user-specified aggregate function, while the skyline query is to find the objects which are not"dominated"by the other objects. These two query methods provide novel ideas and methods for the query processing over the massive data. Therefore, they have received wide attentions, and become a hot topic in the query processing research field. As both the top-k query and the skyline query concern about the results with rank information, we give a uniform name for the top-k query and the skyline query as the rank-aware query in this dissertation.The traditional database query processing technology can only manage the static data, which are stored in the medium such as disk or memory, while in many applications, the data are often emerged as dynamic streams and not to be preserved. This data model is known as the data stream model, which has some special characteristics, such as the data stream are unlimited, continuous, rapid, real-time and so on. So the data streams cannot be managed by the traditional database management technology. Therefore, data stream management technology has become one of hotspots in the database field in recent years. It is now receiving more and more concerns that how to support the rank-aware query processing over data streams. To distinguish between the snapshot-style query processing of the database system and the continuous query processing of the stream system, in this dissertation, we call the query processing of the database system as the snapshot query processing, and that of the data stream system as the monitoring query processing.With the development of computer network technology, more and more data resources are emerged or stored in the networks. It has become an important problem that how to achieve the rank-aware query processing efficiently over the distribution data resources. Currently, the research of the distributed rank-aware query processing technology has made some valuable efforts which, however, are still in early stages of development overall, and are not mature in many ways.In this background, aiming to overcome some deficiency of the existing works, this dissertation conducts an in-depth study which focuses on distributed top-k snapshot query processing, distributed top-k monitoring query processing and distributed skyline snapshot query processing technologies. The main contents of this dissertation are as follows:1. Distributed top-k snapshot query processing. The network delay is a serious effect on the performance of distributed snapshot query processing. However, the existing works are lack of consideration over it. To address this issue, we propose a new method to achieve non-blocking top-k snapshot query processing. By accessing data sources asynchronously and producing results progressively, this method can produce the first few results quickly. Based on this method, we popose a non-blocking top-k snapshot query processing algorithm named PR algorithm and its improver APR algorithm. The theoretical analysis shows that, in the terms of the average network traffic, APR is better than the existing alternatives, and in the terms of the maximum network traffic, APR is the same to the existing alternatives. The experimental results show that, in the terms of the network traffic, the response time and the execution time, APR is superior to the existing alternatives.2. Distributed top-k monitoring query processing. It is the key issue of distributed top-k monitoring query processing over the multiple data streams that how to reduce the network traffic in the monitoring process. To address this issue, an in-depth analysis has conducted for the characteristics of distributed top-k monitoring, and we prove a theorem that the objects set which broken the constraints is the minimal set which are needed for the process of rebuilding the constraints. Therefore, we popose a method of rebuilding constraints based on the minimal constraints rebuilding set. In this method, the number of objects that need to be transmitted is minimal. Based on the method, a sum-oriented distributed top-k monitoring query processing algorithm MR is presented. The theoretical analysis show that the network traffic of the MR algorithm is independent of k. The experiments results show that the network traffic of MR is about 10 percentage to 50 percentage of that of the previous approach. In addition, MR algorithm and the previous approach only support the sum function as the aggregation function of top-k monitoring, while in the practical applications, the aggregation function can be an arbitrary monotone aggregation function. Therefore, based on the minimal constraints rebuilding set, a generic method for distributed top-k monitoring query processing is presented. Based on this method, we propose a distributed top-k monitoring algorithm named GMR algorithm, which supports arbitrary continuous strictly monotone aggregation function. The theoretical analysis show that the network traffic of the GMR algorithm is independent of k. The experiments results show that GMR reduces overall cost by an order of magnitude compared with the na?ve approach.3. Distributed skyline snapshot query processing. It is an important issue of distributed skyline snapshot query processing that how to reduce the network traffic in the query process when the number of nodes is large. To address this issue, we present a method, which reduces the network traffic by accessing data sources via phases, and propose a four-phase distributed skyline snapshot query processing algorithm named FDSL algorithm. The experiments results show that FDSL is superior to the existing alternatives in the terms of the network traffic when the number of nodes is greater than 4.4. Based on the studies of the key technologies stated above, we go on with the implementation issues of distributed rank-aware query processing engine. In the massive data processing middleware platform StarTPMonitor, we design and implement the distributed analysis query processing engine StarAnalysis, which supports distributed rank-aware query processing. The testing results show that StarAnalysis can support distributed analysis query on massive data effectively.To sum up, we present the well-evaluated solutions in this dissertation for some key issues in distributed rank-aware query processing. We believe that our contributions can make a nice groundwork for the future research and engineering on distributed rank-aware query processing both in theory and practice.
Keywords/Search Tags:Distributed Computing, Database, Rank-aware, Top-k Query, Skyline Query, Data Stream Management, Snapshot Query Processing, Monitoring Query Processing, Analytical Query, Query Processing Engine
PDF Full Text Request
Related items