Font Size: a A A

Data mining techniques for frequent itemsets: Construction and analysis

Posted on:2004-08-02Degree:Ph.DType:Thesis
University:State University of New York at AlbanyCandidate:Ramesh, GaneshFull Text:PDF
GTID:2468390011476111Subject:Computer Science
Abstract/Summary:
Data mining or Knowledge Discovery in Databases (KDD), has emerged as one of the most promising areas for database research over the past decade. This thesis combines constructive and analytic approaches to address three major issues: access, feasibility and scalability, that directly impact data mining research and applies these solutions to the classic problem of frequent itemset mining.; Databases are stored in various formats which permit different data access methods that impact the efficiency of the mining task. One important issue in mining is to bridge the gap between mining techniques and database management systems. The first part of this thesis evaluates indexing and data access methods for frequent itemset mining. We systematically compare representative mining approaches using various database formats and analyze their impact on the mining method's performance and storage overhead.; The performance of itemset mining methods is data dependent and is sensitive to the length distribution of the mined patterns. Due to the variation in itemset distributions between real and synthetic datasets, many methods which report good performance on synthetic datasets, perform poorly on real world datasets. In addition, current synthetic datasets are limited in their ability to represent real world itemset distributions. In the second part of this thesis, we characterize feasible distributions of frequent and maximal frequent itemset collections by providing tight bounds. In addition, we also present a constructive technique for synthetic database generation.; One common approach to mining massive databases is to tradeoff accuracy for efficiency through random sampling. The third part, of this thesis presents a novel general purpose sampling framework for empirical evaluation of popular sampling techniques and defines a new general purpose weighted accuracy measure which can be tuned to application specific requirements. A systematic experimental study is presented to evaluate the impact of various control parameters on accuracy. In summary, constructive and analytic methods help in guiding algorithm developers and mining practitioners in decision making.
Keywords/Search Tags:Mining, Data, Frequent itemset, Methods, Techniques
Related items