Research On Unbalanced Text Data Set Classification Algorithm

Posted on:2018-06-26

Degree:Master

Type:Thesis

Country:China

Candidate:Y Yao

Full Text:PDF

GTID:2348330566950397

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Text classification is a hot issue in the field of datamining.But in practice,there are a lot of texts in some classes,while others have relatively few texts,and there is a significant skew between different classes of data sets,among them,the text classifier is to predict the smaller class.Due to the unbalanced distribution of data sets,the features of smaller class cannot be adequately expressed,so the classifier will tend to the larger class.The traditional text classification algorithm has a low recognition rate in the unbalanced data of smaller class,how to effectively improve the classification accuracy of the smaller class has become an urgent problem in the field of machine learning and data mining.The unbalanced text data set classification process includes the following steps: text segmentation,stop word,dimension reduction,text representation,text classification algorithm and classifier evaluation.To improve the accuracy of text classifier is based on the reconstruction of the sample space,the improvement of the text classification algorithm and the improvement of the feature selection algorithm.The main work and innovation of this paper:(1)In the aspect of feature selection algorithm,the information gain(Information Gain)is widely used and the effect is good,but in the face of unbalanced data sets,it is impossible to solve the problem that the features of smaller class are submerged.Therefore,this paper proposes a feature selection method called TF-IG based on information gain feature selection algorithm and term frequency�inverse document frequency algorithm.TF-IG algorithm have priority to select the characteristics of a small number of categories.(2)In the aspect of classification algorithm,the research makes use of Naive Bayes method to solve the multi class classification problem of unbalanced text data sets relatively few.Therefore,this paper proposes a polynomial Naive Bayes text classification algorithm based on weighted complement and Good-turing.When using the naive Bayes text classification algorithm to classify imbalanced text data sets,it is necessary to calculate the probability of feature words in the specified class when using the naive Bayes text classification algorithm.Due to the sparsity of the text space,there will be zero probability of feature words.Therefore,this paper proposes to use the Good-Turing algorithm to smooth the frequency of the polynomial Naive Bayes model to avoid the emergence of zero probability.Since training samples of each class are distributed unevenly,it uses features of current categories' complementary set to represent the features of current categories,which can solve the problem of recognizing the larger category and ignoring the smaller category.

Keywords/Search Tags:

unbalanced data set, feature selection, data smoothing, weighted complementary set, naive bayes

PDF Full Text Request

Related items

1	Research On Improved Naive Bayes Classification Model For Imbalanced E-commerce Review Text
2	Research And Improvement On Na(?)ve Bayes Test Classifier
3	Improvement Of Navies Bayes Text Classification Algorithm Based On Unbalanced Dataset
4	Research On Naive Bayes Classifiers And Its Improved Algorithms
5	Prediction Of Protein Contact Map Based On Weighted Naive Bayes Classifier And Extreme Random Tree
6	An Artificial Immune Based Na?ve Bayes Model For Software Defect Predict
7	Research On Classification Algorithms Of Data Mining Based On Imbalanced Data Sets
8	Research Of Intrusion Dynamic Forensics Model Based On Classification Analysis
9	Naive Bayes Phishing Detection Based On Feature Selection And Reinforcement Training
10	Feature Selection For Unbalanced Data And Emotional Dictionary Building