Font Size: a A A

Algorithms for data stream systems

Posted on:2005-03-26Degree:Ph.DType:Thesis
University:Stanford UniversityCandidate:Datar, MayurFull Text:PDF
GTID:2458390008999319Subject:Computer Science
Abstract/Summary:
In a growing number of information-processing applications, data takes the form of “continuous data streams” rather than traditional stored databases. These applications share several distinguishing features like the need for real time analysis, huge volumes of data, and unpredictable and bursty data arrivals. Application areas include network-traffic monitoring, sensor networks, and many more. For such applications, it is not feasible to simply load the arriving data into a traditional database management system (DBMS) or load it into main memory, and operate on it there. The failure to use these solutions precludes random access to data elements that is assumed in the traditional main memory model of computation. This leads us to the data stream model of computation, in which some or all of the input data that is to be operated on, is not available for random access from disk or memory, but rather arrives as one or more continuous data streams . Using a small amount of main memory and small update time, we are required to give continuous answers to queries posed over data streams.; This thesis considers two sets of issues in the area of data-stream computation: algorithms and theory, and techniques for runtime resource allocation in systems designed for stream computations.; In the area of algorithms we present space- and time-efficient algorithms for a number of problems in the sliding-window model. In this model only the last N elements, where N is the window size, are relevant to gathering statistics or answering queries. We also propose the computation of Hamming norms over data streams, which is a more general problem than the well-studied distinct value estimation problem, and present efficient algorithms for Hamming-norm computation in the presence of inserts and deletes.; In the area of stream-systems we study two runtime resource allocation problems: operator scheduling and load shedding. We develop an operator-scheduling strategy called Chain that minimizes the total memory requirement of the system. We then study load shedding for aggregations queries, and present optimal algorithms that minimize the inaccuracy in query answers given a load constraint. We present a thorough experimental evaluation for both techniques.
Keywords/Search Tags:Data, Algorithms, Stream, Load, Present
Related items