Font Size: a A A

An Improved Method Of Monte Carlo Bayesian Classification

Posted on:2005-06-25Degree:MasterType:Thesis
Country:ChinaCandidate:X QinFull Text:PDF
GTID:2168360122491534Subject:Circuits and Systems
Abstract/Summary:PDF Full Text Request
With the development of information technology and databases' wide use, more and more information is accumulated, and how to find out interesting knowledge from it is a serious problem of our society. Technolegy of knowledge discovery emerge as times require, and become one of the hot research projects. KDD (Knowledge discovery in databases) can find out the effective, novel, latent, and apprehensible information. Data mining is the key step of KDD, which concerns on database, artificial intelligence, and statistics, etc.Classification is the important content of data mining, which assigns dataitems in databases to a special class by constructing a classification function or model (also be called classifier). Therefore, we can predict the unlabelled object classes with the classification model. Unlike other classifications, Bayesian classification bases on mathematics and statistics, and its foundation is Bayesian theory, which answers the posterior probability. Theoretically speaking, it would be the best solution when its limitation is satisfied.Monte Carlo is a method that approximately solves mathematic or physical problems by statistical sampling theory. When comes to Bayesian classification, it firstly gets the conditional probability distribution of the unlabelled classes based on the known prior probability. Then, it uses some kind of sampler to get the stochastic data that satisfy the distribution as noted just before one by one. At last, it can obtain the posterior probability distibution of each unlabelled classes by analysing these stochastic data. It is easy to get a stochastic sample that satisfies some special distribution through running a special Markov chain, so MCMC (Markov Chain Monte Carlo) is the most common Monte Carlo Bayesian method.MCMC method can reduce the costs of time and space in data mining, but it is impracticable in massive datasets' computation. This thesis improves the MCMC method so that it can be adapted to massive datasets' data mining. Our proposed approach is to split the dataset sample into two parts and change the strategy ofscanning datasets into two loop, the inner loop and the outer loop. The scan of the dataset will become the outer loop and the scan of the draws from the posterior distribution. Furthermore, this thesis not only evaluates the sampling efficiency and the effective sample size, but also enhances the practical operation capability of massive datasets' dataming through particle filtering.
Keywords/Search Tags:Data mining, KDD, Bayesian classification, Monte Carlo
PDF Full Text Request
Related items