Font Size: a A A

Research Of Multiply Sectioned Integration Bayesian Classifier Model

Posted on:2008-11-22Degree:MasterType:Thesis
Country:ChinaCandidate:M H SunFull Text:PDF
GTID:2178360212997008Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Data mining, as a multidisciplinary subject, is developing rapidly in recent years. It is involved with database, statistics, artificial intelligence, machine learning and so on. Its major task is to extract valuable knowledge and obtain more available information. In data mining domain, classification is one of the most techniques. It can be used to analyze and study a vast number of related data and establish classifying models in many areas of related problems. A Bayesian network classifier is an important model in the process of knowledge discovery and is a very active topic in many fields. But for difficulties in constructing its network structure and very high time complexity, it has not been considered as a classifier algorithm until the emergence of Naive Bayesian Classifier.A Naive Bayesian classifier can be viewed as a strongly restricted Bayesian network classifier. But its attribute independence assumption makes it unable to express the dependence among attributes in the real world, and affects its classification performance. Many people become focus on how to relax the independence assumption. The TAN relaxed the condition independent hypothesis, by allowing the attributes variable to form a tree which represents their inferior correlation, it is proved to be highly effective and accurate. However, when there are more attribute variables that have complex correlations with each other, the tree structure are unable to reflect the real relations between the attributes, as a result, its accuracy drops down.A Divide and Conquer Method is the one of the best method to deal with the substantive data. The main idea of the method is to divide a big problem into several sub-problems whose complication is less than the big one. We introduce the divide and conquer method to deal with the domain of Bayesian classification and divide the classification task into several sub-modules. Finally it will combine the Conditional Probability Tables and obtain the conclusion.In chapter one, introduce the data mining technology. It contains the produce background and development actuality, and then introduces the common classification models. It includes Decision Tree,Rough Set,Genetic Algorithm,Neural Networks,Bayesian Learning and so on . In chapter two, give the basic Bayesian knowledge, we analyze the Na?ve Bayesian Classification models, Tree Augmented Naive Bayesian Classifier, Bayesian Network Augmented Naive Bayesian Classifier.In chapter three, first introduce the data preprocessing, it includes data cleaning, data sampling, data transformation, data specification and discuss the concept and formula of information entropy and mutual information. Then declare a new arithmetic named Feature Divide Based on Entropy. According to the physical meanings of information entropy and mutual information, we declare the concepts of strong Dependency, general Dependency and weak Dependency .Then we analyze the relativity among the different attributes qualitatively and quantitatively. The original attribute set is divided into several subsets which are conditional independence. The aim of the method is to prepare the data for the new Bayesian Classification models declared in chapter four.In chapter four, In order to deal with the condition independent assumption of Naive Bayesian Classifier, we declare a new Bayesian Classifier models: (MSIB: Multiply Sectioned Integration Bayesian Classifier Model). Firstly, we review the changes of Naive Bayesian Classifier made by other people, introduce the idea of divide and conquer and the using conditions. And then declare the MSIB. The process of building the model is as follows: first, sub-modules are formed by the decision attribute and the subsets divided by FDBE. They get the CPTs themselves. And we integrate the CPTs by the formula and get the final result. We declare a new Bayesian Classifier named Mixed Naive Bayesian. The reference attribute is regarded as the parent node, because the subset is selected by it when Data Preprocessing. We use the TAN module to describe the structure in the sub-modules and add the reference attribute as the parent node of the other attributes and the child node of decision attribute. Through the process, MNB model is built. Finally we compare the new models with the Naive Bayesian Classifier and Tree Augmented Naive Bayesian Classifier. The result of experiment shows that MSIB algorithm has higher accuracy than the TAN when there are many attributes in data set and there are strong relations between them, while, the two classifier are comparable on other situations, and for most of data sets, the MSIB classifier gets higher accuracy than NBC. Compared with NBC, MSIB relax the independence assumptions, it assumes that subset of attributes between independence but the attributes within the subset of attributes among each other have relation. The relation between them is expressed by MNB. NBC assumes all the attributes are independent of each other. Compared with TAN, MSIB is integrated by several sub-classification module, it can describe the more detailed relationship between the attributes. TAN assumes that the most attributes only have two parent nodes. Summing up experiments and theoretical analysis, we can see that MSIB has the better classification effect.As an important method for data mining, the learning of Bayesian classifier still has lots of problems and technical difficulties, it include many domain (for example probability theory, information theory, machine learning and so on ), they are many problems to be ready to research. When we discuss the Multiply Sectioned Integration Bayesian Classifier Model, we find they are some field should be investigated. At the end of this paper, we present several future directions of our research work. This algorithm only deals with the dataset full of the discrete attributes. For the dataset full of the continuous attributes, we reprocess it by discretization tools. But it will lead the loss of information. In the future, we will study how to deal with the dataset full of the continuous attributes by MSIB directly. For the sub-modules, this algorithm regard them as the same importance. But they have different influence on the decision attribute. So we are interest in how to add weight for the CPTs of sub-modules. And more research will be done in the future.
Keywords/Search Tags:Integration
PDF Full Text Request
Related items