Font Size: a A A

Efficient algorithms for similarity and skyline summary on multidimensional datasets

Posted on:2008-09-01Degree:Ph.DType:Thesis
University:University of MichiganCandidate:Morse, Michael DavidFull Text:PDF
GTID:2448390005968353Subject:Computer Science
Abstract/Summary:
Efficient management of large multidimensional datasets has attracted much attention in the database research community. Such large multidimensional datasets are common and efficient algorithms are needed for analyzing these data sets for a variety of applications. In this thesis, we focus our study on two very common classes of analysis: similarity and skyline summarization. We first focus on similarity when one of the dimensions in the multidimensional dataset is temporal. We then develop algorithms for evaluating skyline summaries effectively for both temporal and low-cardinality attribute domain datasets and propose different methods for improving the effectiveness of the skyline summary operation.; This thesis begins by studying similarity measures for time-series datasets and efficient algorithms for time-series similarity evaluation. The first contribution of this thesis is a new algorithm, called the Fast Time Series Evaluation (FTSE) method, which can be used to evaluate similarity methods whose matching criteria is bounded by a specified epsilon threshold value. We then show that FTSE can be used in a framework that can evaluate a rich range of epsilon threshold-based scoring techniques which we call the Sequence Weighted Alignment (Swale) method.; The second contribution of this thesis is the development of a new time-interval skyline operator, which continuously computes the current skyline over a data stream. We present a new algorithm called Lookout for evaluating such queries efficiently, and empirically demonstrate the scalability of this algorithm. In addition, we also examine the effect of the underlying spatial index structure when evaluating skylines. Whereas previous work on skyline computations have only considered using the R*-tree index structure, we show that for skyline computations using an underlying quadtree has significant performance benefits over an R*-tree index.; Current skyline evaluation techniques follow a common paradigm that eliminates data elements from skyline consideration by finding other elements in the dataset that dominate them. The performance of such techniques is heavily influenced by the underlying data distribution. The third contribution of this thesis is a novel technique called the Lattice Skyline Algorithm (LS) that is built around a new paradigm for skyline evaluation on datasets with attributes that are drawn from low-cardinality domains. LS continues to apply even if one attribute has high cardinality.; The utility of the skyline as a data summarization technique is often diminished by the shear volume of points in the skyline The final contribution of this thesis is a novel scheme called the Skyline Point Ordering (SPO) which remedies the skyline volume problem by ranking the elements of the skyline based on their importance to the skyline summary, allowing for the most important skyline points to appear first in the skyline result set and providing monotonic top-k skyline queries that simplify the skyline results. We describe two new algorithms, the Skyline First (SF) and the Coverage First (CF), for ranking the skyline points in a dataset on their summarization importance.; Collectively, the techniques described in this thesis present efficient methods for two common and computationally intensive analysis operations on large multidimensional datasets.
Keywords/Search Tags:Multidimensional datasets, Skyline, Efficient, Similarity, Thesis, Techniques, Common
Related items