Font Size: a A A

A Method Dealig With Sample Imbalances In Text Classification

Posted on:2019-06-04Degree:MasterType:Thesis
Country:ChinaCandidate:X JingFull Text:PDF
GTID:2428330566468204Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Classification is a core research content in data mining.The general classification research is mainly aimed at balanced data sets,but in practice,unbalanced data sets exist in large numbers.Therefore,the classification of unbalanced data sets has practical value.The general classification study aims at improving the overall classification accuracy of data sets.However,for unbalanced data sets,this goal will skew the classification results to the class which has a large number of samples.And in many practical cases,a few categories(categories)training sample number Minor samples to multiple classes(categories of training sample number)sample is greater than the cost of the sample sentence to Minor than a class class samples.Therefore,the classification research direction of unbalanced data sets should focus on improving the recognition ability of the few samples.At present,there are two kinds of classification research methods for unbalanced data sets,which are fdata aspect and algorithm aspect.In this paper,a feature selection method is proposed,and the selection method of partial feature selection is proposed.Aiming at the unbalanced data set,the few classes are considered as positive classes,aiming to improve the classification of Minor F values.This article first gets the news text from the network.Secondly,according to the theme of news,the news text is divided into economic and non-economic categories,and the text is divided into words and lexical filters,and economic classes as a few,non-economic classes as multiple classes.Then,N feature words were extracted from the training data set according to the four feature selection methods,and each text was represented as the characteristic lexical vector.Then,according to the category of news text,this feature lexicon is marked.Finally,the classification algorithm of support vector machine is used to train classification model,and the model performance is tested.The feature selection method used in this paper includes chi-square test,information gain,mutual information and partial selection method.The experimental results show that the Minor class F value can reach 0.65 by using chi-square test,mutual information and information gain.By using the method of partial feature selection,the Minor class F value can reach 0.79.
Keywords/Search Tags:Unbalanced data sets, Classification, Feature, Minor class F value
PDF Full Text Request
Related items