Font Size: a A A

Study Of Data Stream Statistical Algorithms

Posted on:2007-09-15Degree:DoctorType:Dissertation
Country:ChinaCandidate:G L NieFull Text:PDF
GTID:1118360242461910Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In a growing number of information-processing applications in recent years, data takes the form of continuous data streams rather than traditional stored datasets. These application areas include network-traffic monitoring, computer-network security, data mining for ecommerce, sensor networks, financial monitoring and many more. Data stream is modeled as an infinite sequence of finite lists elements, and differs from traditional data in four primary aspects: (a) continuity, (b) unknown or unbounded length, which results in it not feasible to simply load the arriving data into a traditional database management system (DBMS) or load it into main memory, (c) velocity variability, and (d) just one-pass access. These characteristics give birth to the conclusion that operation on data stream must support one–pass access to data stream; the results of statistic are approximation; data stream management systems must support continuous queries. Because traditional techniques of DBMS cannot manage database effectively, many researchers and institutes have been investigating the areas of data stream.The statistics over data stream reflects the current state of data stream, and plays an important role in decision support systems (DSS). It is a foundation research to data mining also. Traditional statistics is not suitable for data streams, so it is necessary to study statistic over data stream. This paper studies many questions in statistic over data stream, such as aggregation functions, hot elements finding, density estimating, change detecting and so on.First, this paper studies the aggregation functions of data stream in sliding window. The sliding window model is useful for discounting stale data in data stream applications. In this model, data elements arrive continually and only the most recent N elements are used when answering queries. Basing on sliding window, this paper providing a new data structure--loose exponential histogram (LEH) and the algorithm to maintain the histogram, using space as small as possible, considering the characteristic of data stream, gives the answer to the approximate sum of data stream. Theory and experiments prove that LEH is effective. Counting is a simple summing, but it is different from summing. After studying the distribution character of data stream, this paper defines the evaluation function F, and designs a system framework based on loose exponential histogram (LEH) to estimate the statistics of data stream over sliding window. Using logarithmic memory, the framework can estimate the count in data stream over sliding window of size N with relative error between 0 andε. Theory and experiments prove that the greater the value of F is, the better the performance of the framework is. As for max (min) function, this paper utilizes chain structure to maintain the maximum over current sliding window. For reducing memory cost, this paper proposes a compression policy to reduce the length of chain effectively.Secondly, this paper brings forward two one-pass algorithms where no hot element will be ignored for another important subject in statistic of data stream--hot elements finding. The first algorithm employs Exponential Histogram to maintain the frequency of elements, and carries out compression process periodically to delete the needless elements. The actual space bounds obtained on experimental data are significantly better than both the worst case guarantees of our algorithm as well as the observed space requirements of earlier algorithms. The first algorithm is well suitable to the skewed data stream. The second algorithm constructs sub-windows and deletes expired sub-windows periodically in sliding window, and each sub-window maintains a summary data structure. The second algorithm outputs at most 1/ε+ 1 elements. This paper extends the second algorithm which is designed for constant size sliding window to variable size sliding window. This paper has carried out experiments for both algorithms.Next, this paper investigates the density estimation of data stream. This paper considers the importance of the recent data, and adopts kernel method, proposes an algorithm for density estimation over data stream. Using memory as little as possible, this algorithm divides data stream into sub-windows, and stores a little information for each sub-window. By integrating information of sub-windows, this algorithm can estimate the density of data stream timely. Theory and experiments prove that the algorithm is accurate and effective for density estimation over data stream.Above subjects reflect the current state of data stream, but they do not describe the change of statistic. Last, this paper solves the question of detecting change over data stream. Detecting change of data stream plays an important role in many data stream's decision support systems. This paper describes the change of data stream by detecting the elements whose frequency difference between two adjoining windows exceeds threshold value. This paper divides single window data stream into several levels, each of which partitions all elements into some groups. Further, this paper defines some supersets over groups, and calculates the sum of each group. After combining sketches of two windows, this paper detects the elements whose value exceeds threshold value by performing binary search. Theory and experiments prove that the algorithm is accurate and effective for detecting change of data stream.
Keywords/Search Tags:data stream, statistic algorithms, approximation algorithms, real-time algorithms, synopsis structure, aggregation functions, hot elements, density estimation
PDF Full Text Request
Related items