Font Size: a A A

Research On The Incremental Learning Of Bayesian Classifier And The Processing Of Missing Data

Posted on:2006-06-21Degree:MasterType:Thesis
Country:ChinaCandidate:X Y LuFull Text:PDF
GTID:2168360155471500Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the development of Database, data warehouse and Internet technique, data mining and knowledge discovery have attracted much attention of many researchers and experts, and they have developed rapidly. Classification is one of the data mining important research topic, its target is to find out the classification function or classification model. Bayesian Network, as an effect way for knowledge representation and probability reason model, is a powerful decision analysis tool dealing with graph of uncertain information. In this paper, we first introduce the main method of classification in data mining, and analyze the current the definition and operation of the method, especially the Bayesian technique. Bayesian Network G=(Bs,Bp) is a DAG with noted probability table, consisting of two parts--network topological structure Bs and partial probability distribution Bp. It bases on Bayes'theorem, maximum a posterior hypothesis and Bayesian networks. Bayesian Network used in classification is called Bayesian Classifier, which is a special form of Bayesian Network for that both variable-choosing and state number have been decided with attribute nodes given and class node unknown. The learning for Bayesian classifier includes structure learning and parameter learning and inference class node of MAP. Current classifiers can work effectively is based on the precondition, that is ,the dataset of training and test is complete , or seldom feature values is complete. In fact, most real-life databases contain missing data because of many reasons, and a great deal of information we can get is often incomplete and missing. The missing data may be correlation with the values of some attributes in the network, now the missing data involves some useful information. Bayesian networks can calculate the numerical inference based on the prior knowledge and the observed data. So Bayesian method is powerful tool dealing with missing data. The main work in the thesis are listed as follow: (1) Generalizing and summarizing the theorem of Bayesian networks. Analyzing the development of current Bayesian classifier. It mainly included Na?ve Bayesian classifier, Tree Augmented Na?ve Bayesian classifier and incremental Bayesian classification model. And analyzing the definition, reason and processing method of missing data. (2) This paper present a incremental method named I-TAN based on TAN and incremental learning. Incremental learning is an effective method for learning the classification knowledge from massive data. Bayesian inference model can become the best choosing model due to its mathematic base and probabilistic denotation, especially it can fully use prior information. In TAN structure, the class variable is a parent of every attribute and other attribute can be parent of every attribute at most, which capture correlations among the attributes and the structure learning is easier than BN. We apply TAN to incremental classification processing of incremental Bayesian inference, updating the structure and parameters of TAN incrementally. The final experimental results show that this algorithm is feasible and effective. I-TAN's classification accuracy is higher than NB and TAN in some dataset. (3) In this paper, a new method of learning Bayesian classifier with missing data named TAN-GS is presented. TAN-GS updating the missing data and TAN is mainly based on the Maximum weighted spanning tree and Gibbs sampling, which combine with the data sampling and structure learning of TAN, then learn Bayesian classifier incrementally. Gibbs sampling's convergence ensures the TAN structure sequence going stable. The specialty of the TAN's structure makes that the joint probability's decomposability can avoid the exponential complexity of standard Gibbs sampling and improve the efficiency of sampling. The experimental results shows that TAN-GS can correct the dataset effectively and learn satisfactory structure and parameter of classifier.
Keywords/Search Tags:Bayesian Networks, incremental learning, Maximum weighted spanning tree, Gibbs sampling, missing data
PDF Full Text Request
Related items