Font Size: a A A

Approximation algorithms for frequency related query processing on streaming data

Posted on:2008-02-12Degree:Ph.DType:Thesis
University:University of Alberta (Canada)Candidate:Deng, FanFull Text:PDF
GTID:2448390005968915Subject:Computer Science
Abstract/Summary:
With the emergence of more and more data stream applications, such as Internet traffic measurement, sensor network data monitoring, Web click and crawling stream analysis, financial data alert and so on, researchers realize that many data processing models and algorithms well-suited for traditional database applications are not applicable in those new streaming scenarios. To address the problem, novel data stream processing systems and algorithms have been proposed within database, theory and computer networking communities. Some of the work has led to commercial systems or algorithms which have been applied in industry. Streaming algorithms also have found their way into non-streaming environments where massive data processing is needed.; This thesis focuses on the algorithmic aspect of data stream processing, more specifically, approximation algorithms for answering frequency related queries on streaming data. Example-queries are "find the number of similar record pairs in a very large relational table", "identify URLs that appear for the first time in the crawling stream of a search engine" and "give a list of IDs which appear more frequently in a web click stream". Efficiently answering these queries with bounds on time and space costs is both important and challenging, because fast response is either required or desirable in many scenarios, and the available computing and fast storage resources are often very limited compared with the massive streaming data volume. Thus, approximation algorithms trading accuracy for computing or storage resources is a valuable option in these cases.; In this thesis, two types of data reduction techniques, namely sketching and sampling, are studied. Both techniques are suitable not only for answering frequency related queries over streaming data, but also have a wide range of other application areas. This thesis focuses on exploring these powerful techniques to answer frequency related queries under data stream scenarios. More precisely, based on well-known sketching and sampling techniques, this thesis proposes new data structures and one-pass approximation algorithms to answer membership queries, frequency queries, join/self-join size estimations, similarity join/self-join size estimations and result set size estimations for similarity searches. Both theoretical and experimental analysis show that our new techniques improve the state-of-the-art.
Keywords/Search Tags:Data, Stream, Approximation algorithms, Frequency related, Processing, Size estimations, Techniques
Related items