Font Size: a A A

Research Of Association Mining

Posted on:2006-04-14Degree:MasterType:Thesis
Country:ChinaCandidate:M MaFull Text:PDF
GTID:2168360155961259Subject:Computer applications
Abstract/Summary:PDF Full Text Request
With the rapid development of information technology and WWW applications, massive amounts of data have been continuously collected in the databases of many application areas, which contain much useful patterns, and it is very important to find the hidden and previously unknown information for these areas. Data mining aims at the task of the above work. Association rules mining is a form of data mining to discover previously unknown, interesting relationships among attributes from large databases. Due to its simple form and being easy to understand, association rule mining has attracted great attention in database, artificial intelligence and statistics communities, and a lot of achievements have been made in its study.The size of database increasing dramatically, there are many challenges emerged in mining association rules from dense database. Most traditional association rules mining algorithms based on Apriori algorithm, the kind of algorithm is not an appropriate choice for mining dense database. Apriori employs a bottom-up, breadth-first search that enumerates every single frequent itemsets. Apriori-inspired algorithms show good performance with sparse database such as market-basket data , where the frequent itemsets are very short. However, with dense datasets such as telecommunications and census data, which have many long frequent itemsets, the performance of these algorithms degrades incredibly. The degradation is due to the following reasons: these algorithms perform as many passes over the database as the length of the longest frequent itemsets. This incurs high I/O overhead for scanning large disk-resident databases many times. Secondly, it is computationally expensive to check a large set of candidates by itemsets matching, which is especially true for mining long itemsets. To conquer the difficulty of mining association rules from dense database, in this thesis we present a novel vertical data representation called Diffset, that only keeps track of differences in the tids of a candidate itemsets from its generating frequent itemsets. Diffset drastically cut down the size of memory required to store intermediate results, which enormously improve the performance of these mining algorithms.The problem of the relevance and usefulness of extracted association rules, using traditional algorithms from dense database, is of primary importance because most of real-life databases lead to several thousands association rules with high confidence,in the majority of cases, among which are many redundancies and those redundant associations rules make it difficulty for users to finding the useful information from all the produced association rules.The set of frequent closed itemsets is a subset of the frequent itemsets. As for the orders of magnitudes, frequent closed itemsets is far smaller all frequent sets. Besides, frequent closed itemsets can be used to uniquely determine the set of all frequent itemsets and their exact frequency, as a result, all valid association rules can be found. Therefore, frequent closed itemsets are the generating set of all frequent itemsets and valid association rules without any information loss. Moreover, it is plausible to efficientlymine all frequent closed itemsets even from dense database . In this thesis, we have a consistently study on the algorithm of mining minimal non-redundant association rules based on frequented closed itemsets.Finally, with the development of the technology of datawarehouse with various advantage and the widespread application of datawarehouse, it provides new ideas for conquering the difficulties in mining association rules from dense database. In this thesis, we have a explicit study on association rules mining based on OLAP. Chapter 6 gives a summary of this thesis.
Keywords/Search Tags:Data Mining, Association Rule Mining, Frequent Itemsets, Frequent Closed Itemsets, Diffsets, Non-Redundant Association Rules, DataWareHouse, OLAP
PDF Full Text Request
Related items