Font Size: a A A

Analysis On Sampling Complexity Of Association Rule Mining

Posted on:2005-04-01Degree:DoctorType:Dissertation
Country:ChinaCandidate:C Y JiaFull Text:PDF
GTID:1118360185495664Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The explosive progress in networking, storage, and processor technologies is resulting in an unprecedented amount of digitization of information. With the rapid expansion of databases, mining all the data available in database or data warehouse in reasonable time is already beyond the reach of current system. People are immersed in more and more profuse data, but run short of knowledge. They must consider how to manage ever-growing data that can be too large to be processed at one time.As for association rule mining, a prototypical problem in data mining, it is costly or impossible to repeatedly pass over the large-scale databases. The speed and scalability of the algorithm are major concerns. One possible approach for dealing with huge amounts of data is to take random sample and do association rule mining on it, instead of the whole database, to obtain the approximate answers. But people need to make a trade-off between the accuracy of answers and the efficiency of algorithm when using sampling method. Unfortunately, it is an open, difficult and crucial problem on how to determine the sample size needed for a given accuracy, i.e., how to select sampling complexity.In this dissertation, focusing on the problem of selecting sampling complexity for association rule mining, we investigate some significantly relevant problems standing in need of solutions, such as quantifying and estimating sample error, improving accuracy of sampling methods, designing a novel approach to determine sampling complexity, analyzing the essence of PAC (Probably Approximate Correct) theory for association rule mining, generalizing the PAC framework in the field of association rule mining, using the research results of sampling complexity in new areas and so on.The main contributions of this dissertation are summarized as follows:1) A novel, flexible three tuples model is given to measure sample error. And a high efficiently computational method, interval estimation algorithm of cardinal error (IEACA for short), is proposed for estimating sample error. The empirical studies show that the error between sample set and original dataset can be obtained effectively and accurately by the model, also we can efficiently and exactly estimate the representative capability of sample set to original dataset by IEACA. The research of this part is the basic of this dissertation.2) Inspired by MRA (Multi-resolution Analysis) and Shannon sampling theorem, we give a new, adaptive, on-line, incremental, fast sampling method, multi-scaling sampling (MS for short), for quickly obtaining acceptably approximate answers satisfying the requirement of...
Keywords/Search Tags:Data mining, association rule, frequent itemsets, sampling complexity, sampling bound, sample error, cardinal error, multi-scaling sampling, sampling ensemble, Monte Carlo theory, bias-variance decomposition, majority voting, PAC theory
PDF Full Text Request
Related items