Font Size: a A A

The Study On Frequent Patterns Mining And Data Predicting Over Data Streams

Posted on:2009-05-25Degree:DoctorType:Dissertation
Country:ChinaCandidate:H ChenFull Text:PDF
GTID:1118360275470846Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
In the past few years, data streams widely appeared in many applications, such assensor network, financial data management, network monitoring, web log analysis and thecommunication data online analysis. Because the size of a data stream is usually rather hugeand increases quickly, it is impossible to keep all the stream data due to the storage spacelimitation, which brings much trouble for data stream management. Additionally, becauseof the continuity and ?uidity of stream data, the knowledge embedded in a data stream ismost likely to change as time goes by. In many data stream applications, it is much moreinteresting to mine the tendency of the knowledge embedded in a stream than to mine theknowledge itself. Researchers hope to mine the knowledge in a recent sliding window of anonline data stream.Mining the frequent patterns over an online data stream is an important work in manydata stream applications. For example, in network monitoring, the frequent patterns corre-sponding to excessive traffic might be an indication for network attacking. In sails transac-tions, the frequent patterns always relate to the top selling products in a market and possiblytheir relationships. In sensor network data management, mining the frequent patterns ofstream data can help to estimate the missing values. However, because of the characteristicsof stream data, the traditional mining algorithms for statistic databases can not be directlyapplied to mine the frequent patterns over an online data stream, so it is necessary to designnew algorithms for mining a data stream. For a data stream mining algorithm, it is essentialto incrementally handle the continuously arriving stream data by scanning the stream onlyonce, and it is important to maintain the latest data synopsis of a data stream with least cost.Additionally, as a new stream datum enters into the sliding window, the oldest datum be-comes obsolete and should be removed from the window. In order to eliminate the impactof obsolete stream data on the mining result and to improve the correctness, a data streammining algorithm should also periodically delete the information of obsolete stream datafrom the data synopsis.The method for mining the recent frequent patterns of a data stream uses a pattern tree,RFP-tree, to incrementally keep the patterns of newly arriving stream data, and it also pe-riodically prunes the tree to delete the branches which register the obsolete or infrequent patterns from the tree. The RFP-tree conservatively keeps the complete frequent patternsin the sliding window of size N over an online data stream at the cost of maintaining thecontents of the recent stream data whose size is no more than 2N. Additionally, the approx-imate frequencies of patterns could be calculated by a conservative strategy. Because theapproximate frequencies of patterns are no less than their actual frequencies, the coverageof the mining result is affirmatively to be 100 percent.In order to adaptively mine the frequent patterns in a sliding window in a variable size,the method for mining the frequent patterns in an arbitrary sliding window uses a patterntree named SW-tree to incrementally maintain the patterns of the stream data in the slidingwindow. Additionally, as time goes by, it differentiates the patterns in the newly generatedstream data from those in the historic stream data by gradually decaying the weight of thestream data with the time decay model. In order to ensure the coverage and precision of amining result, the method also analyzes the in?uence of time decay model on the frequen-cies of patterns, and gives the bound values for the decay factor f that could guarantee thecorrectness of a mining result. As the size of a sliding window changes, the correctness ofmining the new sliding window could be guaranteed by choosing another appropriate valuefor the decay factor again.In the data stream applications, the patterns contained in stream data are variable as thestream data varying, so it is impossible to give an appropriate value for the minimal supportthreshold that is used to mine a data stream. The method for mining the Top-K frequentpatterns in a sliding window provides a much simpler method to mine the frequent patternsin a data stream. In this method, it is unnecessary for the user to give the minimal supportthreshold but the size of expected frequent patterns set. This method estimates the support ofthe Kth frequent pattern in the sliding window using Chernoff Bound, and uses it to maintainthe potential frequent patterns in the sliding window. Based on the theoretical analysis, theChernoff Bound can also guarantee the correctness of mining result with probability.It is also an important work to study the varying tendency of the historic stream data ina data stream and predict the possible values of the stream data in the future time window.By studying the value changing of the stream data in a real data stream, the predictivemethod based on Markov first maps the stream data space whose size might be infinite intoa finite stream data state space, and replaces the stream data sequence with a stream datastate transmission sequence. By keeping the statistical knowledge of the state transmissionsequence using a diagraph named SSTD, the state transmission probability matrix could be obtained from the diagraph, and then the possible values of the stream data in the futuretime windows can be predicted using Markov Chain Model.
Keywords/Search Tags:Data Stream, Stream Data Mining, Sliding Window, Frequent Pattern, Top-k Frequent Pattern, Predictive Inquiry
PDF Full Text Request
Related items