Font Size: a A A

Algorithms Of Frequent Closed Itemsets Mining And Their Applications

Posted on:2010-08-13Degree:DoctorType:Dissertation
Country:ChinaCandidate:J B ChenFull Text:PDF
GTID:1118360302958559Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Frequent itemset mining is a popular and important first step in the analysis of data arising in a broad range of applications. From the points of view of both theories and applications, the research for the efficient algorithms to find the frequent itemset is very important.It is well known that frequent itemset mining often generates a very large number of frequent itemsets and rules, which reduces not only efficiency but also effectiveness of mining since users have to sift a large list of mined rules to find useful ones. There're two interesting alternatives, one of them is termed "Maximal Frequent Itemset Mining" , the other is "Frequent Closed Itemset Mining" . "Frequent Closed Itemset Mining" could remove all the redundancy meanwhile ensure that there's no information loss. In contrast, "Maximal Frequent Itemset Mining" is inherent in information loss."Frequent Closed Itemset Mining "is studies thoroughly from different points of view in this paper. It includes batch algorithms, incremental algorithms, approximate algorithms and applications of it in recommendation systems.1. Batch Algorithms:FCII is a very simple and efficient algorithm for mining frequent closed itemsets. This algorithm is based on an alternative, but equivalent representation of the target problem. Theoretically, the new algorithm could complete in linear time according to the number of frequent closed item sets. Empirically, the new algorithm outperforms the state of arts algorithms like CLOSET+, FPCLOSE, etc. In addition, the algorithm could deal with new added data in an incremental way. Finally, it contains more information than the traditional frequent itemsets algorithms which could be very important to mine the approximate frequent itemsets.2. Incremental Algorithms: A data stream is a continuous, huge, fast changing, rapid, infinite sequence of data elements. The nature of streaming data makes it essential to use online algorithms which require only one scan over the data for knowledge discovery. GC-Tree and GC-Tree 2.0 are online algorithms which work under data stream sliding window environments. GC-Tree maintains a tree structure in the memory. All closed itemset is represented as a node in the tree, and every path from the root to a node in the tree is an " order preserving" sequence of closed itemsets. Based on this data structure, GC-Tree is equipped with an efficient search space pruning technique. Theoretically, the computational complexity of GC-Tree is a quadratic function of the average size of the transactions; and the computational complexity of GC-Tree 2.0 is a linear function of the average size of the frequent closed itemsets. These theoretical analysis could be supported by the experiments on both synthetic and real-world data sets.3. Approximate Algorithms:For real applications, the data is usually subject to noise and measurement error. According to the recent theoretical reports, in the presence of even low levels of noise, large frequent itemsets are broken into fragments of logarithmic size. If the noise data could be detected effectively, the number of the output could be reduced dramatically. Based on this idea, the novel algorithm, AFCIM, is proposed to mine approximate frequent closed itemsets. AFCIM is derived from FCII which could be easily adopted as an incremental algorithm. AFCIM algorithm tries to detect noise data when explores the search space, once the noise data is found, AFCIM will fix it immediately and update the existing closed itemsets. The experiments demonstrate that AFCIM is more efficient than both the approximate frequent mining algorithms and the frequent closed itemsets mining algorithms.4. Application in Recommendation Systems:There are two inherent disadvantages of the traditional recommendation systems, dimensional curse and data sparsity. "Localization Approach" is proposed to deal with these problems: Firstly, discover the dense sub-matrix in the big sparse matrix by AFCIM algorithm. Secondly, build local model in the dense sub-matrix by the traditional recommendation system methods. Lastly, calculate the weighted average of the output of the local models to make a synthesized recommendation. 1-pLSI model is a "localized" variant of pLSI model, which is a traditional recommendation system model. The experiments demonstrate that 1-pLSI model could improve the precision of the predication effectively.
Keywords/Search Tags:Data Mining, Frequent Closed Itemsets, Association Rules, Data Stream, Approximate Algorithm, Recommendation System
PDF Full Text Request
Related items