Font Size: a A A

Data mining for large datasets: Intelligent sampling and filtering

Posted on:2007-07-17Degree:Ph.DType:Thesis
University:State University of New York at AlbanyCandidate:Satyanarayana, AshwinFull Text:PDF
GTID:2458390005488772Subject:Artificial Intelligence
Abstract/Summary:
Data Mining and knowledge Discovery has emerged as one of the most promising areas for research over the past decade. However in many real world problems, mining algorithms have access to massive amounts of data. There are two fundamental challenges of dealing with these datasets. The first one is to reduce the amount of time it takes for the mining algorithm to execute, given the size of the dataset as the input. The second challenge is to improve the performance of the learner in terms of assigning unseen instances to a known class or set, based on feature values.; The first part of our research shows that mining all the available data is prohibitive dale to computational (time and memory) constraints. Much of the current research is concerned with scaling up data mining algorithms (i.e. improving on existing mining algorithms for larger datasets) to reduce the running time. An alternative approach is to scale down the data. This part of the thesis addresses one approach for scaling down data namely intelligent sampling and applies this solution to the classic problem of handling large datasets. We study and characterize the properties of learning, curves, integrate them with bounds such as Chernoff and Chebyshev in an effort to determine the smallest sufficient dataset size that obtains approximately the same accuracy as the entire available dataset. This part of research focuses on selecting how many (sampling) instances to present to the miming algorithm.; The second part of our work aims at improving the performance of the underlying learner for large datasets by eliminating noisy instances in the training set in order to build simpler models, and also to improve the predictive accuracy of the learned model. Many techniques in the literature have been proposed to address this problem. However, a clear understanding of why they work and what computation is being formed is missing. In this area of research, we first analyze how eliminating noisy, instances leads to simpler and more accurate models. A Bayesian analysis is then preformed over three prominent existing filtering approaches. Insights into understanding these other techniques are used as a basis for presenting our novel general purpose noise handling framework.
Keywords/Search Tags:Mining, Data, Sampling
Related items