Font Size: a A A

Improvement Of Navies Bayes Text Classification Algorithm Based On Unbalanced Dataset

Posted on:2019-01-03Degree:MasterType:Thesis
Country:ChinaCandidate:K ChenFull Text:PDF
GTID:2428330548976871Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Text classification is a hot issue in the field of data mining,but in practice,it always has a significant skew between different classes of data sets,namely there are a lot of texts in some classes,while others have relatively few texts,however,the text classifier is aimed to predict the smaller class which carries important information.Due to the unbalanced distribution of data sets,the features of smaller class cannot be adequately expressed,so the classifier will tend to the larger class.The traditional text classification algorithm has a low recognition rate in the unbalanced data of smaller class,how to effectively improve the classification accuracy of the smaller class has become an urgent problem in the field of machine learning and data mining.There are three ways to improve the accuracy of imbalanced text classifier,respectively the improvement of the sample space,the improvement of the text classification algorithm,and the improvement of the combined classification algorithm.The main work of this paper includes:(1)In the aspect of sample space improvement,most existed researches just consider the sample amount,they do not consider the sample weight,so,the KWCNB text classification algorithm was proposed in this paper,the algorithm uses the KNN algorithm select neighbor samples from the majority class,and calculate the weight of neighbors,then,the weight of the selected training samples are used to modify the formula of the complement Naive Bayesian,which can solve the problem of unbalanced data distribution and weaken the attribute independence assumption in CNB as well.(2)In the aspect of text classification algorithm improvement,the existed naive Bayes method has gained a good performance in text classification,so the algorithm is introduced to solve the imbalanced text datasets.Therefore,the TFWCNB text classification algorithm was proposed in this paper,the algorithm uses the weighted attribute to improve the complement Naive Bayesian algorithm,and uses TF-IDF algorithm to calculate feature words' weight in current documents,which can solve problem that the classifier prone to big classes and ignore small classes.(3)In the aspect of combinatorial classification algorithms improvement,the existed combination classifiers do not consider the relationship between the basic classifier algorithm and the training sample weight.Therefore,the ADAWCNB text classification algorithm was proposed in this paper,the algorithm uses AdaBoost algorithm to modify training samples weight through continuous iteration,which makes the classifier emphasis on those misclassified training samples,and the training samples weight is used to modify the base classifier Naive Bayesian algorithm.The algorithm uses a combination classifier which is more accurately than its base classifier,and uses the weights of training samples to modify the complement Naive Bayesian,which further solve the problem that small classes are always misclassified into big classes.This paper uses the classification accuracy rate,recall rate and F-measure to evaluate the performance of the improved algorithms,the simulation results show that ADAWCNB algorithm has the best performance no matter based on the balanced dataset or unbalanced dataset,KWCNB algorithm is inferior to ADAWCNB,TFWCNB algorithm has the worst performance,however,the three algorithms are all better than the traditional NB and CNB algorithm,which means the improved algorithms has advantages to a certain degree.
Keywords/Search Tags:Unbalanced data set, Text weighting, Attribute weighting, Ensemble classifier, Naive Bayes
PDF Full Text Request
Related items