Font Size: a A A

Study On Correlation Analysis And Mining In Real-Time Data Streams

Posted on:2009-10-12Degree:DoctorType:Dissertation
Country:ChinaCandidate:T C ZhangFull Text:PDF
GTID:1118360308479885Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
In various real-time monitoring applications,such as network monitoring, financial data analysis, sensor network, and RFID, data streams arrive as rapid, time-varying, unbounded, unpredictable sequences of data items, which require online analysis and dynamic processing. It's not feasible to use traditional database technique to manage these continuous and rapid data items.How to perform efficient processing over data streams yield some new research problems.Recently data streams have received considerable attention in various communities.How to find the features and hidden relations over continuous data streams is the main goal for data stream analysis and mining. In this area, similarity search is a popular research topic due to its importance for further mining job, such as clustering, classification, frequent item discovery and novelty detection. In this paper, we choose correlation coefficient as the distance measurement for similarity search and propose a series of algorithms for fast correlation analysis in multiple data streams.Meanwhile, we build a new similarity model for event streams and introduce an event-oriented similarity search method. Our main contributions are listed as follows:(1)A novel data reduction technique based on Boolean representation is proposed. Original sequences are transformed into Boolean sequences which can reflect complex trends by a long binary number. We can get the analysis results effectively by simple Boolean operations.(2) A hierarchical Boolean representation (HBR) algorithm is introduced for correlation analysis.Original sequences are firstly transformed into the Macro-Boolean sequences which reflect the main trends.After that, we can get the Macro candidate set by compute the Macro-Boolean correlations.Then, we can obtain the Micro-Boolean sequences which reflect the detailed information by the transformation of series in the Macro candidate set. In the end, we get the final correlation set by the calculation of Micro-Boolean correlations.By theoretical analysis, we prove that Boolean correlation coefficient is very similar with Pearson correlation coefficient for any sequence pairs.(3)A periodic detection technique based on Boolean representation is given to search the periodical trends of each stream sequences.Theoretical analysis shows that Boolean auto-correlation coefficient curve and auto-correlation coefficient curve have almost the same local maximum, which correspond the periodic point. Therefore, we can obtain the periodical information of original stream sequences efficiently by Boolean auto-correlation coefficient. (4) An efficient correlation analysis algorithm WACA is proposed to adjust the sliding window size adaptively. We divide multiple streams into several overlapped groups, and choose the mean value of each stream's periodic in the same group as the optimal sliding window size. After that, we adopt HBR technique for further synchronized correlation analysis.When the stream environment changes, multiple of streams will be regrouped dynamically to choose the new window size.(5)A lag correlation analysis technique based on Boolean representation is introduced of rapid lag detection between each stream pairs.Similarly, original stream sequences are transformed into Boolean sequences for efficient lag search. By theoretical analysis, we know that lag correlation and Boolean lag correlation have certain functional relations, correspond to the same monotony interval. Therefore, lag correlation coefficient curve and Boolean lag correlation coefficient curve have the same trends, we can allocate the lag time effectively just by the latter one.(6) A data reduction and reconstruction method based on lag correlation in multiple data streams is shown for stream mining job.After lag correlation detection we align all the lag streams together. Then, we choose principal component analysis (PCA) to reduce the dimension of multiple streams.In this way, we can also reconstruct the original data streams when some data are very important.(7) An event stream similarity analysis model is built to solve the problems of processing event-oriented stream data. We propose a novel similarity search algorithm EOS based on extent of sharing for event segment in event streams.We firstly analyze the feature and application requirement of event streams.Considering the truth that similar event sequences share lots of event segments, we introduce a similarity search algorithm based on extent of sharing for event segment. After taking into account of the frequency, weighing and location of event segments, this method can detect the similar event pairs with much smaller candidate set.In summary, this dissertation dedicates to study several fundamental problems related to correlation detection in data streams as well as similarity search in event streams.Theoretical analysis and experimental results shows that such methods could efficiently guarantee enough precision with low computation and space complexity for stream processing compared to existing stream analysis technique.
Keywords/Search Tags:data stream, time series, boolean representation, auto-correlation, lag correlation, self-adaptivity, event stream, extent of sharing
PDF Full Text Request
Related items