Font Size: a A A

Research On Unbalanced Text Data Set Classification Algorithm

Posted on:2018-06-26Degree:MasterType:Thesis
Country:ChinaCandidate:Y YaoFull Text:PDF
GTID:2348330566950397Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Text classification is a hot issue in the field of datamining.But in practice,there are a lot of texts in some classes,while others have relatively few texts,and there is a significant skew between different classes of data sets,among them,the text classifier is to predict the smaller class.Due to the unbalanced distribution of data sets,the features of smaller class cannot be adequately expressed,so the classifier will tend to the larger class.The traditional text classification algorithm has a low recognition rate in the unbalanced data of smaller class,how to effectively improve the classification accuracy of the smaller class has become an urgent problem in the field of machine learning and data mining.The unbalanced text data set classification process includes the following steps: text segmentation,stop word,dimension reduction,text representation,text classification algorithm and classifier evaluation.To improve the accuracy of text classifier is based on the reconstruction of the sample space,the improvement of the text classification algorithm and the improvement of the feature selection algorithm.The main work and innovation of this paper:(1)In the aspect of feature selection algorithm,the information gain(Information Gain)is widely used and the effect is good,but in the face of unbalanced data sets,it is impossible to solve the problem that the features of smaller class are submerged.Therefore,this paper proposes a feature selection method called TF-IG based on information gain feature selection algorithm and term frequency–inverse document frequency algorithm.TF-IG algorithm have priority to select the characteristics of a small number of categories.(2)In the aspect of classification algorithm,the research makes use of Naive Bayes method to solve the multi class classification problem of unbalanced text data sets relatively few.Therefore,this paper proposes a polynomial Naive Bayes text classification algorithm based on weighted complement and Good-turing.When using the naive Bayes text classification algorithm to classify imbalanced text data sets,it is necessary to calculate the probability of feature words in the specified class when using the naive Bayes text classification algorithm.Due to the sparsity of the text space,there will be zero probability of feature words.Therefore,this paper proposes to use the Good-Turing algorithm to smooth the frequency of the polynomial Naive Bayes model to avoid the emergence of zero probability.Since training samples of each class are distributed unevenly,it uses features of current categories' complementary set to represent the features of current categories,which can solve the problem of recognizing the larger category and ignoring the smaller category.
Keywords/Search Tags:unbalanced data set, feature selection, data smoothing, weighted complementary set, naive bayes
PDF Full Text Request
Related items