Font Size: a A A

Research On Key Technologies In Data Stream Analysis

Posted on:2009-03-20Degree:DoctorType:Dissertation
Country:ChinaCandidate:L SuFull Text:PDF
GTID:1118360278956579Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the fast development of hardware, network and communication technology, and the ac-celeration of comprehensive real world applications, there occurs a new kind of data type, namelydata stream, which exits in many fields, such as finance analysis, network data management, trac-ing moving objects, communication network monitoring, sensor network management, and etc.The traditional database query processing technology generally manage the static data, which arestored in the medium such as disk or memory, and is hardly suit for the unlimited, continuous,rapid and"one scan"data stream applications. So, the data stream applications provide higherrequirements for the data management and analysis. It is a major challenge in database researchfield that how to quickly find the valuable information over massive data streams.The research of data stream has been widely concerned by the academy and industrial field,and can be divided into two parts, data stream management and data stream analysis. Basedon the fruit and deficiency of existing works, this dissertation conducts an in-depth study whichfocuses on the four important issues of data stream analysis, outlier detection, skyline computing,subsequence matching and index structure. The main contents of this dissertation are as follows:1. Outlier detection on distributed data streams. Based on the comparison and analysis ofexisting outlier metrics, this dissertation extends the distance-based and density-based outlier def-initions which combine the kernel density estimation technique. It will face with two problems inoutlier detection over distributed data streams, how to improve the detection ratio of global outliersand reduce the network communication. On the base of the common star network structure, Dis-OutlierStreams algorithm is provided which has high efficiency and lower communication. Thenon-parameter kernel estimation technique can obtain the probability density function in sliding-window quickly. An exponential fading strategy is used to catch the evolution of data stream.Divergence Technology can reduce the overhead of network communication with the guaranteeof detection precision. In the implementation this algorithm takes full advantage of the powerfulsymbol and numerical computation in Matlab. theoretical analysis and experiments show that thisalgorithm can effectively reduce the communication overhead which has no relation with the sizeof sliding window, and has high performance and scalability.2. Sparse skyline computing over data stream. The average number of skyline set increasesexponentially with the number of objects and the dimensions. On the other hand, skyline set isseriously affected by the data distribution. The quick increase of skyline set decreases the qualityfor the online service and decision making applications. Based on summarization of the relatedwork for skyline computing, we propose a novel concept, called sparse skyline, which uses a skyline object to represent its nearby skyline neighbors within the acceptable difference (δ). Then,two algorithms are provided, the basic sparse skyline BSS algorithm and the extend ESS algorithm.BSS algorithm builds on the original skyline algorithm, and adopts an effective pruning techniqueto reduce the computation. ESS algorithm constructs a sparse skyline feature tree to organize thepoints in sliding window directly, which can adjust adaptively and progressively the quality ofthe sparse skyline set according to the correlations of data distribution. Comparing with the BSSalgorithm, The ESS algorithm has stronger manageability and higher efficiency.3. Subsequence matching on data stream. Subsequence matching on time series database hasbeen researched for many years, however, subsequence matching on data stream is still in earlystages of development. On the basis of analysis the existing matching metrics, we select the DTW(dynamic time warping) distance which can tolerate the noise and skew. It can been consideredfrom four sub-problems: best matching, range matching, optimal range matching, top-k optimalrange matching. Aiming to the high computation in time warping matrix, this dissertation providesa new FSM algorithm, which has lower space complexity and can process the stream data in realtime. It makes the best of similarity threshold to prune the redundant computation as early aspossible. Theoretical analysis and experiments with synthetic and real data show that this methodis faster than the existing algorithms, especially to the high dimensions and long query sequences,it only increases several bytes on the memory cost without the loss of precision.4. Index structure for data stream. Index technique is one of key techniques to improvethe performance of data stream query and analysis. This dissertation analyzes the prior R-Treevariants to support frequent updates on data stream, and designs an improved QDM-Tree whichconsiders how to improve the update performance and decrease the storage overhead all together.It also provides the corresponding algorithms for the insert, delete and query operations. Usingthe hash method replaces the high time-consuming operation– tree traversing. It also supports thelazy group deletion. It quantizes the tree nodes and updates the data in a"bottom-to-top"approach.The experimental results show that, comparing with the existing index tree variants, QDM-Treehas almost the same storage overhead and higher performance for updating and querying.In summary, this dissertation addresses the four key issues in data stream analysis, andpresents several high efficient solutions which analysis the overhead of computation, storage andcommunication in a comprehensive way. It is a promotion of stream data processing on boththeoretical study and practical applications.
Keywords/Search Tags:Data Stream, Data Stream Analysis, Outlier Detection, Skyline Computing, Subsequence Matching, Index Structure
PDF Full Text Request
Related items