Font Size: a A A

Fast algorithms for mining association rules and sequential patterns

Posted on:1997-01-10Degree:Ph.DType:Dissertation
University:The University of Wisconsin - MadisonCandidate:Srikant, RamakrishnanFull Text:PDF
GTID:1468390014483256Subject:Computer Science
Abstract/Summary:
This dissertation presents fast algorithms for mining associations in large datasets. An example of an association rule may be "(30% of customers who buy jackets and gloves also buy hiking boots." The problem is to find all such rules whose frequency is greater than some user-specified minimum. We first present a new algorithm, Apriori, for mining associations between boolean attributes (called items). Empirical evaluation shows that Apriori is typically 3 to 10 times faster than previous algorithms, with the performance gap increasing with the problem size. In many domains, taxonomies (isa hierarchies) on the items are common. For example, a taxonomy may say that jackets isa outerwear isa clothes. We extend the Apriori algorithm to find associations between items at any level of the taxonomy.; Next, we consider associations between quantitative and categorical attributes, not just boolean attributes. We deal with quantitative attributes by fine-partitioning the values of the attribute and then combining adjacent partitions as necessary. We also introduce measures of partial completeness which quantify the information loss due to partitioning. These measures can be used to determine the number of partitions for each attribute. We enhance the Apriori algorithm to efficiently find quantitative associations.; Finally, we consider associations over time, called sequential patterns. An example of such a sequential pattern may be "2% of customers bought 'Ringworld' in one transaction, followed by 'Foundation' and 'Ringworld Engineers' in a later transaction". We allow time constraints between elements of the sequential patterns, and also allow all the items in an element to be present in a sliding time window rather than at a single point in time. We present GSP, an algorithm for finding such patterns, based on intuitions similar to those for the Apriori algorithm.; All the above algorithms scale linearly with the size of the data (for constant data characteristics). These algorithms have been used in a variety of domains, including market basket analysis, attached mailing, fraud detection and medical research. We conclude the dissertation with directions for future work.
Keywords/Search Tags:Algorithm, Mining, Sequential, Associations, Patterns
Related items