Font Size: a A A

Itemset Distribution Mining And Its Applications In Pattern Analysis

Posted on:2005-01-21Degree:MasterType:Thesis
Country:ChinaCandidate:J L LuFull Text:PDF
GTID:2168360125465147Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
In recent decades, with the development of technology, the ability of producing and collecting data is improved dramatically. Thus abundant data are accumulated. It is difficult to analyse these very large databases by only using existing methods. An embarrassing phenomenon is that " drowning in data but starving for knowledge". People wish to generate some new techniques and tools to analyse these data automatically and intelligently. Facing to this challenge, Data Mining emerged. Data Mining is a process of nontrivial extraction of implicit, previous, unknown and potentially useful knowledge from a large amount, incomplete of noisy, fuzzy and random data. Data mining is a hot topic in database, artificial intelligence, statistics etc. It attracts a great deal of attention from experts, researchers and information companies.Association rule mining is an important topic in data mining and plays a key role in boosting the research, development and application of data mining techniques. This leads to many significant technologies and methodologies for identifying association rules, such as those of Apriori-like algorithms. They are mainly focused on issues of algorithm scale-up and data reduction. Generally, an itemset is frequent in a database if its support is greater than or equal to the user-specified minimum support. This minimum-support constraint leads to two problems: (1) Setting minimum-support is quite subtle and (2) Frequent pattern (itemset) mining often leads to the generation of a large number of patterns (and an even larger number of mined rules).Recognizing the limitations of support-confidence framework, many techniques for attacking the issues have been developed, mainly including mining top-k frequent closed patterns[18], mining by postponing constraints from mining to evaluation[21], confidence-driven mining strategy without minimum-support[22] and identifying frequent itemsets without support threshold[23]. These approaches attempt to avoid specifying the minimum-support to some extent. These efforts provide a good insight into frequent pattern discovery from databases. To solve this problem, we propose a fuzzy strategy (FARDIMS) with database-independent minimum-support, which provides a good man-mechine interface. This strategy allows users to specify the minimum-support threshold without any knowledge concerning their database to be mined. Compared with the traditional algorithm, FARDIMS is more automatic and intelligent. However, there is still a fundamental problem in the application of frequent itemsets: how reliable the frequent itemsets are in a database, what must be known when we are marketing or making decisions. For frequent itemsets discovered from a database, all of them have equally been important in applications though they may have different supports. This can generate poor-quality decisions. Generally, an itemset with higher support should be more reliable than an itemset with relatively low support when they are applied for making decisions. However, we don't measure how much a frequent itemset is more reliable than another frequent itemset, if the distribution of itemsets in a database is unknown. This is because the support of an itemset cannot reflect the reliability. This generates a significant requirement that is reliability analysis. In the thesis, we propose the concept of frequent itemset's reliability and design two methods to estimate the itemset's reliability on algorithm scale-up.The main contributions of this thesis are as follows:We propose a fuzzy strategy (FARDIMS) to identify interesting itemsets without specifying the true minimum-support.The distribution of itemsets in a database is identified. Mining the itemset distribution of databases assists in measuring how reliable a frequent itemset is in the database, which users need to take into account when they are making decisions in uncertainty environments.The distribution of itemsets is approximated in this thesis. This gains the benefit of the scalability.To evaluate the...
Keywords/Search Tags:Data Mining, Association Rule, Fuzzy Control, Itemset Distribution, Sampling
PDF Full Text Request
Related items