Font Size: a A A

Research On Index-Based Skyline Algorithms

Posted on:2008-03-13Degree:DoctorType:Dissertation
Country:ChinaCandidate:H F ZhouFull Text:PDF
GTID:1118360215984423Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The advance of technology has made data collection easier and faster, thus causing many applications involved in mass memory. Based on these facts, a new operator, which was always seen as memory-based method before, had been intro-duced into database field since 2001.The operator, called for "skyline" operator, aims to find the points that are not dominated by any other point in the dataset. A point dominates another point if it is as good or better in all dimensions and better in at least one dimension. What is good or bad is not strictly defined, and totally depends on the user preference.Based on the definition above, the skyline operator can draw a picture of the object database, and help the users to find the interesting objects. Unfortunately, the obvious difference, between the definitions of the skyline operator and those in traditional query languages, makes it necessary to develop all kinds of new algo-rithms to implement the operator. Recently, skyline computing has been becoming a hot topic due to its potential applications in many scenarios, including traditional database, distributed database, data stream and even the categorial database and so on. Both the academic and the industrial have paid much attention to it.In this thesis, we focus mainly on exploring index-based skyline algorithms. Taking full use of indexing technology, we study skyline computing in any subspace of high dimensional dataset, the problem of sampling the large-size skyline and skyline computing in categorial database. The contributions of this thesis include:1.In this thesis, we propose a new skyline algorithm, called CSky(Count the skyline), which can efficiently evaluate the skyline in any subset of high di-mensional dataset, after considering the difficulties and surveying the existing skyline algorithms. The difficulties are that, in one hand, the other index-based skyline algorithms can not or can not efficiently evaluate the skyline in any subset of the high dimensional space; in the other hand, the non-indexed skyline algorithms can do it, but it is very inefficient. Based on these facts, CSky makes full advantage of the characteristics of a novel data structure, called Inverts. One of them is to sort the attribute values in each dimension and puts the most possible skyline points to the firstly scanned places, thus assuring that CSky can return the skyline points, progressively and effectively. At the same time, another characteristic of InvertS is that, the skyline of any subset in high dimensional space can be found by scanning the InvertS struc-ture at most one pass. Therefore, CSky can not only efficiently find the skyline like other index-based algorithms, but also do it in any subset of the whole query space. In addition, we extend CSky algorithm to process the variants of skyline computing.2.We develop an algorithm to sample the skyline of high dimensional dataset, called SSky(Sample the skyline), in order to prevent too large-size skyline puzzling the users. Consequently, a new concept, called for mild points set, is introduced in this thesis. It consists of those points, which minimal scores in all dimensions are the most among all the points of a dataset. And then, we prove the coincidence relation between one skyline point of high dimensional space and the most mild points of one subset, thus enabling it to transform skyline computing in high dimensional space to figuring out the most mild points in all subsets. The latter costs the linear time in the cardinality of that high dimensional dataset for fixed dimensionality k, with the exact complexity O(kn+k2~k){k << n). We prove the above conclusion in theory. Furthermore, we demonstrate that the sample skyline, returned from SSky, is fairly similar to the origin skyline.3.In this thesis, we study the evaluation of two kinds of skyline queries in cate-gorial database: the static and the dynamic skyline computing. One of them is the normal-meaning(also called static skyline computing) skyline query. To conquer it, we first prove one to one correspondence between one categorial dataset and one lattice structure. Based on the correspondence, we can map the data in query database to the points of the lattice structure, thus trans-forming skyline computing to traversing the lattice structure, which not only costs much less overhead, but also avoids scanning the database many passes. In my thesis, the processing method is named with LBS(Lattice-based skyline algorithm). Secondly, we consider another kind of skyline computing in catego-rial database in the condition that the order of the attribute values is dynamic with the user preference. Similarly, we excavate the relations between the dy- namic order and the adjusting of the lattice. Thus, we can map the procedure of changing the attribute order once to regulating the lattice structure one time. Furthermore, we can traverse the regulated lattice to gain skyline, in-stead of directly computing the skyline by comparison. Obviously, traversing the lattice costs much less overhead than the computing by comparison, which is proved to have the exact time, space complexity, for O(u2~u) and O(2~u+n) respectively.To sum up, based on InvertS and lattice structures, the thesis presents the in-tegral solutions from the definition of concept, the construction of data structure to the formulation of algorithms. CSky and SSky are great complementarity and improvement to existing skyline algorithms on high dimensional datasets. In the thesis, a new problem is expressed as skyline computing in dynamic-order categorial database. we explore a lattice-based LBS algorithm to attack it. Theoretic analysis and experimental results show that our index-based methods can solve their corre-sponding problems efficiently and outperform previous skyline algorithms in time and space complexity. At the same time, these algorithms have strong online algo-rithm characteristics. Hence, it could be concluded that the proposed algorithms can be applied to online data analysis for very large database.
Keywords/Search Tags:skyline, subspace, sample skyline, categorial database
PDF Full Text Request
Related items