Improvement Of Navies Bayes Text Classification Algorithm Based On Unbalanced Dataset

Posted on:2019-01-03

Degree:Master

Type:Thesis

Country:China

Candidate:K Chen

Full Text:PDF

GTID:2428330548976871

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Text classification is a hot issue in the field of data mining,but in practice,it always has a significant skew between different classes of data sets,namely there are a lot of texts in some classes,while others have relatively few texts,however,the text classifier is aimed to predict the smaller class which carries important information.Due to the unbalanced distribution of data sets,the features of smaller class cannot be adequately expressed,so the classifier will tend to the larger class.The traditional text classification algorithm has a low recognition rate in the unbalanced data of smaller class,how to effectively improve the classification accuracy of the smaller class has become an urgent problem in the field of machine learning and data mining.There are three ways to improve the accuracy of imbalanced text classifier,respectively the improvement of the sample space,the improvement of the text classification algorithm,and the improvement of the combined classification algorithm.The main work of this paper includes:(1)In the aspect of sample space improvement,most existed researches just consider the sample amount,they do not consider the sample weight,so,the KWCNB text classification algorithm was proposed in this paper,the algorithm uses the KNN algorithm select neighbor samples from the majority class,and calculate the weight of neighbors,then,the weight of the selected training samples are used to modify the formula of the complement Naive Bayesian,which can solve the problem of unbalanced data distribution and weaken the attribute independence assumption in CNB as well.(2)In the aspect of text classification algorithm improvement,the existed naive Bayes method has gained a good performance in text classification,so the algorithm is introduced to solve the imbalanced text datasets.Therefore,the TFWCNB text classification algorithm was proposed in this paper,the algorithm uses the weighted attribute to improve the complement Naive Bayesian algorithm,and uses TF-IDF algorithm to calculate feature words' weight in current documents,which can solve problem that the classifier prone to big classes and ignore small classes.(3)In the aspect of combinatorial classification algorithms improvement,the existed combination classifiers do not consider the relationship between the basic classifier algorithm and the training sample weight.Therefore,the ADAWCNB text classification algorithm was proposed in this paper,the algorithm uses AdaBoost algorithm to modify training samples weight through continuous iteration,which makes the classifier emphasis on those misclassified training samples,and the training samples weight is used to modify the base classifier Naive Bayesian algorithm.The algorithm uses a combination classifier which is more accurately than its base classifier,and uses the weights of training samples to modify the complement Naive Bayesian,which further solve the problem that small classes are always misclassified into big classes.This paper uses the classification accuracy rate,recall rate and F-measure to evaluate the performance of the improved algorithms,the simulation results show that ADAWCNB algorithm has the best performance no matter based on the balanced dataset or unbalanced dataset,KWCNB algorithm is inferior to ADAWCNB,TFWCNB algorithm has the worst performance,however,the three algorithms are all better than the traditional NB and CNB algorithm,which means the improved algorithms has advantages to a certain degree.

Keywords/Search Tags:

Unbalanced data set, Text weighting, Attribute weighting, Ensemble classifier, Naive Bayes

PDF Full Text Request

Related items

1	Improvement And Application Of Naive Bayes Aglorithm Based On Attribute Selection Weighting
2	Text Classification Algorithm Research Based On Naive Bayes
3	The Research Of Na(?)ve Bayes Classification Algorithm Based On Atrribute Reduction And Attribute Weighting
4	Research On Text Classification Algorithm Based On Naive Bayes Method
5	Research And Application Of Naive Bayesian Classification
6	Research On Improved Multinomial Naive Bayes Text Classification Algorithms
7	Bayesian Classification Algorithm Based On Attribute Discretization And Its Application
8	Research On Bayesian Networks-Based Text Classification Algorithms
9	Research About The Selective Naive Bayesian Classification Based On Weighted Attributes
10	Study On Naive Bayesean Algorithm Based On Attributes Weighting And Reduction