Analysis On Sampling Complexity Of Association Rule Mining

Posted on:2005-04-01

Degree:Doctor

Type:Dissertation

Country:China

Candidate:C Y Jia

Full Text:PDF

GTID:1118360185495664

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

The explosive progress in networking, storage, and processor technologies is resulting in an unprecedented amount of digitization of information. With the rapid expansion of databases, mining all the data available in database or data warehouse in reasonable time is already beyond the reach of current system. People are immersed in more and more profuse data, but run short of knowledge. They must consider how to manage ever-growing data that can be too large to be processed at one time.As for association rule mining, a prototypical problem in data mining, it is costly or impossible to repeatedly pass over the large-scale databases. The speed and scalability of the algorithm are major concerns. One possible approach for dealing with huge amounts of data is to take random sample and do association rule mining on it, instead of the whole database, to obtain the approximate answers. But people need to make a trade-off between the accuracy of answers and the efficiency of algorithm when using sampling method. Unfortunately, it is an open, difficult and crucial problem on how to determine the sample size needed for a given accuracy, i.e., how to select sampling complexity.In this dissertation, focusing on the problem of selecting sampling complexity for association rule mining, we investigate some significantly relevant problems standing in need of solutions, such as quantifying and estimating sample error, improving accuracy of sampling methods, designing a novel approach to determine sampling complexity, analyzing the essence of PAC (Probably Approximate Correct) theory for association rule mining, generalizing the PAC framework in the field of association rule mining, using the research results of sampling complexity in new areas and so on.The main contributions of this dissertation are summarized as follows:1) A novel, flexible three tuples model is given to measure sample error. And a high efficiently computational method, interval estimation algorithm of cardinal error (IEACA for short), is proposed for estimating sample error. The empirical studies show that the error between sample set and original dataset can be obtained effectively and accurately by the model, also we can efficiently and exactly estimate the representative capability of sample set to original dataset by IEACA. The research of this part is the basic of this dissertation.2) Inspired by MRA (Multi-resolution Analysis) and Shannon sampling theorem, we give a new, adaptive, on-line, incremental, fast sampling method, multi-scaling sampling (MS for short), for quickly obtaining acceptably approximate answers satisfying the requirement of...

Keywords/Search Tags:

Data mining, association rule, frequent itemsets, sampling complexity, sampling bound, sample error, cardinal error, multi-scaling sampling, sampling ensemble, Monte Carlo theory, bias-variance decomposition, majority voting, PAC theory

PDF Full Text Request

Related items

1	Multicast-based Distributed Association Rule Mining Algorithm
2	The Research On Sampling For Data Mining
3	Based Sampling Of Distributed Association Rule Mining Algorithm
4	Research On Importance Sampling Technology Of Communication System In High-Speed Railway
5	Low Rank Approximation Of High Dimensional Matrix Based On Sampling And Its Application
6	The Development And Research Of Audit Sampling Systems On The Basis Of Data Mining Technology
7	Enhanced Sampling Theory And Methods In Monte Carlo Simulation
8	Monte Carlo Sampling Methods Based On Dynamics
9	A Technical Study On The Communication Signal Demodulation Based On PF
10	A Technical Study On The Communication Signal Demodulation Based On Pf