Font Size: a A A

Research On Algoritms To Process Skyline Query Over Data Stream

Posted on:2009-02-27Degree:DoctorType:Dissertation
Country:ChinaCandidate:S L SunFull Text:PDF
GTID:1118360272458855Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Data stream is a sequence of digitally encoded signals used to represent information in transmission, which is generated from applications such as network routing, traffic and environmental monitoring, stock trading and electronic business activities etc. The study on streaming data is one of the hot topics among the database circle all over the world recently. As an important data mining technique, skyline query has been receiving considerable attention among database community. Skyline query returns a set of interesting objects which are not dominated by any other objects within the base dataset. Skyline query is of great significance for urban navigation, data mining visualization, multi-criteria decision making and preference query. As skyline query is proposed, techniques to process skyline query mushroom.Item within data stream can only be treated once because streaming data is of the nature of rapid arriving rate, large size and independence. Novel algorithm to process skyline query over data stream efficiently in a limited memory is still on demand. As to the topic of skyline query processing over sliding window, there already exist some methods. But some important issues about this topic are still open questions. Skyline query in full-space and only over centric, certain data stream is addressed in existing work. Besides, there exist critical defects about them. Existing approaches are improved fundamentally in this thesis at first. Furthermore, we set forward to solve some emerging issues about this topic. Attributes of object tend to be regarded differently; data stream in reality is always generated by a distributed manner; more and more attention has been attached to probabilistic data stream analyzing recently, but skyline computation over probabilistic data stream is still at large. These emerging requirements entail great challenges to skyline query processing over data stream. We are committed to subdue these challenges effectively and efficiently in this thesis, and our endeavors and contributions can be concluded as following:(1).We proposed a novel algorithm GridSky to process skyline query over sliding window efficiently. As to the algorithm processing skyline query over sliding window, its performance is decided by efficiency of the critical procedure to compute influence time of new arriving object and to eliminate the dominated objects. This problem is not well addressed in existing work, hence leads to low performance of existing approaches. More adaptive index basic grid, which is more applicable to data stream environment, is adopted by GridSky. Based on this index structure, pruning rule based on timestamp, pruning rule based grid and optimizing rule based on layered visit are developed to speed up GridSky progressively. Massive comparing experiments demonstrate that, GridSky outperforms its competitors by at least one order of magnitude as to the point of time complexity while consuming less memory space. Besides, GridSky is more scalable over data distribution, dimensionality and cardinality than its competitor.(2).We studied the issue of skyline query over distributed data streams, and proposed a communication-optimal algorithm BOCS. Query processing and data mining over distributed data streams have received considerable attention within database community recently. We are the first to address skyline query processing over distributed data streams where streams derive from multiple horizontally-spit data sources. Based on strategy of progressive refinement, skyline is computed incrementally in BOCS through two phases. In the first phase, local skylines on remote sites are maintained efficiently by GridSky and only skyline increments on remote sites are transmitted to coordinator at each time. Some skillful measures are developed to compute delta skyline efficiently in this phase. In the second phase, global skyline is obtained by integrating remote increments with the latest global skyline. Theoretical analysis is provided and shows that BOCS is communication optimal among all algorithms which are out of share-nothing strategy. Extensive experiments demonstrate that our proposal is efficient, scalable and stable.(3). We proposed an efficient algorithm StreamSubsky to handle subspace skyline computation over sliding window. Streaming data mining and skyline computation within subspaces have recently received much attention in data mining community. Previous work only sought to maintain full space skyline, we are the first to handle the problem of subspace skyline computation over sliding window. Some useful rules between full-space and subspace skyline are found by us in StreamSubsky, we propose an efficient top-town method to incrementally output the skyline objects within specified subspaces henceforth. Moreover, we present a set of optimizing heuristics to speed up critical procedures. Theoretical analysis and extensive experiments demonstrate that our algorithm is able to output the first result only taking a very short time compared with previous approaches, and our method are both efficient and scalable.(4).Inherent uncertainty and unreliability of data exist widely in emerging applications like information integration, RFID and sensor network based ubiquitous computation etc. The management of uncertain, probabilistic data stream has recently attracted considerable attention among researchers. Although previous work has well addressed skyline computation over static data or traditional certain data stream, skyline query over probabilistic data stream is still at large. To the best of our knowledge, we are the first to model this issue formally based on "possible worlds" semantics, furthermore an effective and efficient algorithm SOPDS is proposed to handle this issue. Compared with skyline query over traditional data stream, there are two basic different problems within this topic: how to efficiently confirm the status of new arrival and how to knock out unpromising objects as early as possible. Based on more adaptable index structure, a set of heuristic rules like probability bounding, progressive refinement, pre-elimination and selective compensation are developed to improve the comprehensive performance of SOPDS from the point of reducing both CPU overhead and memory cost. Massive back to back experiments demonstrate that our algorithm is of high overall performance, SOPDS is efficient, stable and scalable.In one word, four important issues about skyline query over data stream are well studied in this thesis and efficient algorithms to handle these issues are proposed respectively. Our approaches are great complement and improvement to existing skyline algorithm. Theoretical analysis shows our approaches are effective and efficient, extensive comparing experiments validate the theoretical observation.
Keywords/Search Tags:Data Stream, Data Mining, Skyline Query, Distributed Data Steam, Subspace, Probabilistic Data Stream
PDF Full Text Request
Related items