Font Size: a A A

Pattern Mining Algorithms Over Data Streams

Posted on:2014-08-11Degree:DoctorType:Dissertation
Country:ChinaCandidate:L WangFull Text:PDF
GTID:1268330425477276Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Along with the rapid development of information science and the explosive accumulation of industrial data, data mining and analyzing is attracting more and more concern from all areas of industry. The demand for timely processing and analyzing of ongoing data makes it an important research topic that focuses on the mining techniques on data streams. Because data streams are massive, real-time and volatile, mining algorithms on data streams should be more efficient on both space and time. A number of valuable academic works have been performed on data mining over data streams, yet the improvement of mining efficiency is still a scientific hotspot that needs further research.This dissertation focuses on pattern mining algorithms over data streams, including frequent pattern mining over traditional datasets, frequent pattern mining over big data datasets, frequent pattern mining over uncertain data streams, and high utility pattern mining. Firstly we reviewed existing algorithms on frequent pattern mining and high utility pattern mining, with detailed introduction for Apriori and FP-Growth algorithm; then based on the study of classical mining algorithms and state of the art research works, performed in-depth researches on frequent pattern mining over traditional datasets, frequent pattern mining over uncertain datasets, and high utility pattern mining over datasets. Innovative research achievements of this thesis are presented as the following.(1) In the study on algorithms of mining frequent pattern over traditional datasets, this thesis proposes a tail-node data structure and a parallel algorithm with no more than two rounds of MapReduce. For the problem of frequent patter mining over data streams, tail-node&tail-node table are utilized in a sliding window approach to improve the time efficiency of data updating within the window, as well as the space efficiency of data structure maintenance; together with strategies to improve the time efficiency of within-window frequent pattern mining process, the overall performance of the mining algorithm is improved.For the problem of mining frequent patterns over streams of big data:firstly, one round of MapReduce is performed to find local frequent patterns as candidate itemset; secondly, a pruning strategy is utilized to prune the candidates; finally a second MapReduce is performed to count the support numbers of the remaining itemset; in most cases, only one round of MapReduce is needed for our algorithm to discover all frequent patterns.(2) In the study on algorithm of mining frequent pattern over uncertain transaction datasets, a tree structure with higher compression ratio is proposed to improve the efficiency of the mining algorithms. The probability values of transaction itemsets are stored in arrays and then mapped to a tree to reduce the number of nodes needed for maintaining transaction dataset; together with the sliding window approach, two new tree data structures are proposed to maintain the within-window data and sub datasets created in the process of mining, ensures no loss of information during the mining process, and significant improvement of time&space efficiency of the mining algorithm. Furthermore, this thesis also proposes a new weight based frequent pattern mining model; this model takes into consideration the item’s weight factor of a transaction itemset in the mining process, and includes high weight patterns into the mining result.(3) In the study on algorithms of mining high utility patterns, new mining algorithms are proposed without over-estimated utility values and without generating candidate itemsets.An innovative tree structure is proposed for maintaining transaction itemsets and their utility information; accurate utility values instead of over-estimated values can be retrieved from this tree, so high utility patterns can be discovered without generating candidates, and the algorithm efficiency is improved.Based on this algorithm and the sliding window method, together with a new tree structure for maintaining the within-window data, the proposed algorithm mines high utility patterns from the data stream without generating candidate itemsets by one scan of dataset. Compared with the state of art algorithm UP-Growth published in KDD and TKDE, the proposed algorithms can be one or two orders of magnitude better in terms of running time.
Keywords/Search Tags:Data Mining, Association Rules, Frequent Pattern, High Utility Pattern, Uncertain Dataset, Data Stream, Big Data
PDF Full Text Request
Related items