Font Size: a A A

Research Of Improved Mutual Information-Based Naive Bayesian Classification Model

Posted on:2011-05-11Degree:MasterType:Thesis
Country:ChinaCandidate:L F ZhangFull Text:PDF
GTID:2178360305455229Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
As a cross-disciplinary, data mining got a rapidly development in recent decades. It is an integrated application of various disciplines including computer science, mathematics, engineering science, neuroscience and other disciplines. The main purpose of DM technology is to extract potentially useful patterns or knowledge from the large, irregular and complicated data. The patterns and knowledge can be used to guide the production and living activities of human beings.Data classification is one of important subjects in data mining research. It has great practical significance because of many commonly decision-making problems in real-life can be converted into data classification. Naive Bayesian classification model (NBC) is one of classic classification models. Although NBC is a simple and efficient classification model, it can not convey accurately the dependence between attributes because of its requirement to meet the conditions for the assumption of independence between attributes too restrictive and the assumption has a negative impact on the classification accuracy.To solve this problem, researchers refer to the idea of feature selection and proposed mutual information-based Naive Bayesian Classifier model (MI-NBC). It aims to use the mutual information value as the rating of indicators to distinguish between strong and weak attributes. It deletes the weak attributes before the start of the classification and uses only the strong ones to categorize. This approach not only greatly reduces the amount of data processing and accelerates the speed of data classification process, but also reduces the negative impact from weak attributes which will help to improve the accuracy of data classification.This paper referred to the idea of feature selection in MI-NBC model, and realized a more fine-grained one on each data record. And then it proposed a general mutual information-based Naive Bayesian Classifier model (MI-NBC). After that, this paper made a discussion on parameters estimation and data pre-processing, and provided the corresponding methods. Ultimately, the classification experiments on several data sets proved the GMI-NBC model was better than the MI-NBC and the NBC model on accuracy performance. The GMI-NBC model is proved to be feasible.In the first part of this paper, we introduced some relevant background knowledge of DM technology and theoretical basis of Bayesian classification. First, the introduction described some of the major cause of DM and development status. It made a brief description of the DM technology concepts and the main task of DM, and predicted the development trend of related technologies. And then it described the definition of classification problem, often used methods, model evaluation criteria and assessment tools.In the second chapter, we first introduced some concepts including full probability formula, Bayesian theorem and maximum a posteriori assumptions and the maximum likelihood estimate. And then we made a discussion about some common Bayesian classification models including Naive Bayesian model, Bayesian network model, Tree augmented Bayesian network model and General Bayesian classification model. We analyzed the strengths and weaknesses of model, and proposed information theory-based feature selection method to improve it.The second part describes the improved model Naive Bayesian MI-NBC, and base on this model proposed a new improved model GMI-NBC. MI-NBC model starts from original Naive Bayesian classification. It adds feature selection operations into NBC model, reduces the computation by deletes some of the noise attributes, and finally improved the classification accuracy. Feature selection operations calculated for each attribute in the training sets for the classification of the mutual information, and then sort them from small to large amount of information, then naturally divided these attributes into two properties collection by the given appropriate threshold. In order to achieve the purpose of screening properties, we should delete the weak property in training and testing set before classification with Naive Bayesian model.The General Mutual Information-based Naive Bayesian Classification model (GMI-NBC) made an improvement on MI-NBC. GMI-NBC model is also the application of feature selection idea, but considering the starting point not the entire training set, but on the training concentration of each record. Model calculates the characteristic mutual information value to the final classification on each record, and then divides the properties into two sets like the MI-NBC. In the following classification process, only the strong properties will be used. Finally adoption in a number of experiments on the UCI data sets and compared with experimental results prove that GMI-NBC has a better accuracy performance in most of the data sets than NBC and MI-NBC models and the model's feasibility.By analyzing the final data results achieved in third chapter, we found that good classification accuracy needs a suitable strength-property threshold. And then how to estimate this threshold come to the key issues to perfect the GMI-NBC model. In chapter four of this paper, a self-adapt algorithm to estimate threshold by 10 - fold cross-validation on training set was introduced to solve the problem of threshold estimating. Finally, we got improved process design of GMI-NBC model.After the paper, there are still many issues remain to be studied further on the research of GMI-NBC model. Although we find a suitable measure of dependence between attribute, but we still believe there would have other measures which need less calculation and produce better accuracy than mutual information. Also in the further research of strength-property threshold estimation methods, we hope we can find a more simple and efficient estimation methods.
Keywords/Search Tags:Data Mining, Classification, Naive Bayesian Classification, Mutual Information
PDF Full Text Request
Related items