Research On Selective Bayesian Classifiers

Posted on:2009-03-01

Degree:Doctor

Type:Dissertation

Country:China

Candidate:J N Chen

Full Text:PDF

GTID:1118360242989826

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Classification is an elementary and important task in pattern recognition, machine Learning and data mining. As one of the best classification methods, Bayesian classifiers, being founded on Bayesian statistics and Bayesian networks, have strong ability of processing incomplete data, meaningful model and high precision. Especially the Naive Bayesian classifier (NB), as the foremost and simplest one, has very high classification accuracy matching or even exceeding that of other mature classifiers, such as C4.5 in many situations. Furthermore, NB has strong ability to counteract noise data.NB has been applied to many areas since it was proposed and its effectiveness has been verified in practice. With the increase of its application, however, its disadvantage becomes more and more clear. A strong conditional independence assumption for NB is made like this: in each class the probability distribution of an attribute (i.e. feature) is independent of that of other attributes. Datasets in reality, however, usually do not satisfy this assumption, which often reduce the classification effect of NB obviously. One method to resolve this problem is to delete redundant attributes by attribute selection and construct NB on remaining attributes (i.e. construct selective Naive Bayesian classifier). Some effective algorithms about selective Bayesian classifiers have been proposed. Most of them, however, are applied to complete and low-dimensional datasets. In reality there are many incomplete datasets, and most of them contain redundant or irrelevant attributes that can seriously infect the classification effect and efficiency. Due to the complexity of processing incomplete data, however, there are still very few selective classifiers for incomplete datasets. So, constructing selective Bayesian classifiers for incomplete data, which can utilize the advantage of Bayesian classifiers for processing incomplete data, is an important task, and is one of the main research content of this dissertation.In addition, with the development of information technology, more and more high-dimensional data comes forth. As NB is simple and efficient, so it is very suitable for processing high-dimensional data. But it is also sensitive to attribute selection. Hence, the study of selective Bayesian classifiers applied to high-dimensional datasets is important, and is another research content of this dissertation.The main contributions of this dissertation are described as follows. (1) With the analysis of main methods of processing incomplete data for classification, the Distribution-based Bayesian Classifier for Incomplete data (DBCI) is proposed. In the training process of DBCI the frequencies of missing values are distributed reasonably to that of other observed values. So, information contained in incomplete datasets can be sufficiently utilized. Not only the classification accuracy of DBCI is similar with that of the very effective classifier for incomplete data named RBC (Robust Bayes classifier), but also its efficiency is higher than that of the later.(2) Though incomplete datasets often contain many redundant or irrelevant attributes that can seriously infect classification efficiency and effect, but there are still very few selective classifiers for incomplete data. And so, two selective Bayesian classifiers for incomplete data based on wrappers are presented. At first, through the analysis of classifiers for incomplete data we construct the Selective Robust Bayesian classifier (SRBC). Compared with RBC and DBCI, SRBC can get much higher accuracy and can sharply delete redundant or irrelevant attributes. Then, with more effective DBCI, we present the Selective Distribution-based Bayesian Classifiers for incomplete data (SDBC). Compared with SRBC, SDBC is more efficient and effective.(3) In order to further improve the efficiency of SRBC and SDBC, we present three selective classifiers based on hybrid methods. At first, with SRBC and a simplified gain ratio formula, we present the Selective Robust Bayes classifiers Based on Gain ratio (SRBCBG). Also with SRBC and Chi-Square statistics for incomplete data we construct the Chi-square-Based Selective Robust Bayes Classifiers (CBSRBC). Compared with SRBC and SDBC, both CBSRBC and SRBCBG have much higher efficiency and accuracy. In order to construct more scalable selective classifiers for large incomplete datasets, with the expended Relief algorithm and SDBC we give the Relief-F-algorithm -Based Selective DBCI (RBSD) that is more efficient than SRBCBG and CBSRBC.(4) For the most abundant high-dimensional dataâ€”â€”text data, we present twoattribute evaluation functions applied to multi-class text data for the construction of selective Bayesian classifiers with filters method. The classification results on text datasets show that the selective Bayesian classifiers using these two evaluation function perform much better than with other functions.

Keywords/Search Tags:

Bayesian classification, Attribute selection, Incomplete data, High-dimensional data, Text classification

PDF Full Text Request

Related items

1	Improvement Research And Application Of Bayesian Classification Based On Different Scenes
2	Research Of Ensemble Learning For High-dimensional And Imbalanced Data Classification
3	Research On Approximate Granularity Feature Selection And Classification Methods For High-Dimensional Data
4	Research And Application Of Naive Bayesian Classification Based On Attribute Selection
5	Design And Implementation Of Data Mining Classification System Based On Hadoop
6	Bayesian Classification Algorithm Based On Attribute Discretization And Its Application
7	Centroid-based dimension reduction methods for classification of high dimensional text data
8	Dynamic Classification For Ultrahigh Dimensional Binary Data
9	News Selection And Classification Based On Triple-play Service
10	Feature Selection Models And Methods Based On Information Measure For High Dimensional Data