Font Size: a A A

Models and operators for continuous queries on data streams

Posted on:2007-11-29Degree:Ph.DType:Dissertation
University:University of California, Los AngelesCandidate:Law, Yan NeiFull Text:PDF
GTID:1458390005984527Subject:Computer Science
Abstract/Summary:
A new generation of data-intensive applications is emerging for managing and querying information that, rather than residing in databases, flows continuously through the network in the form of massive data streams. Hence, there is much research work on designing Data Stream Management Systems, and the approach favored by many research projects consists in extending database languages and technology for data streams. However, the new computational environment brings significant research challenges in areas such as query languages, query processing, and advanced applications.; In this dissertation, we first focus on the limitations of relational languages in expressing continuous stream queries. A main limitation follows from the fact that only nonblocking operators can be used in continuous queries, which makes relational languages incomplete on these queries. To address this problem, we investigate user-defined aggregates natively defined in SQL itself, and prove that these make SQL (i) Turing-complete on stored data, and (ii) complete on data streams. Furthermore, we illustrate the effectiveness of the proposed extensions on complex applications involving time-series queries, and mining queries.; For advanced applications, we focus on data-stream mining algorithms, which must now be redesigned to make lighter demands on resources and display greater adaptability than those on stored data. In this dissertation, we introduce ANNCAD, which uses multi-resolution data representation to classify new test points using the nearest-neighbors principle. The incremental property and very fast update speed make ANNCAD very suitable for mining data streams. Our experiments show that ANNCAD is adaptive and works well in many applications, including image recognition and censor surveying.; We then study the problem of stream query processing. We propose a load-shedding technique for multi-join, called MSketch, which makes decisions based on the productivity of tuples, rather than only the content of the joined pair stream. A thorough study shows that MSketch outperforms other existing algorithms. Finally, we propose general techniques for optimizing the accuracy of window aggregates, statistical aggregates and mining queries in the presence of sampling. Our method incorporates prior knowledge into an error model that is used to reduce the uncertainty introduced by sampling. We also extend the method to adjust to concept shifts.
Keywords/Search Tags:Data, Queries, Applications, Continuous
Related items