A Method Dealig With Sample Imbalances In Text Classification

Posted on:2019-06-04

Degree:Master

Type:Thesis

Country:China

Candidate:X Jing

Full Text:PDF

GTID:2428330566468204

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Classification is a core research content in data mining.The general classification research is mainly aimed at balanced data sets,but in practice,unbalanced data sets exist in large numbers.Therefore,the classification of unbalanced data sets has practical value.The general classification study aims at improving the overall classification accuracy of data sets.However,for unbalanced data sets,this goal will skew the classification results to the class which has a large number of samples.And in many practical cases,a few categories(categories)training sample number Minor samples to multiple classes(categories of training sample number)sample is greater than the cost of the sample sentence to Minor than a class class samples.Therefore,the classification research direction of unbalanced data sets should focus on improving the recognition ability of the few samples.At present,there are two kinds of classification research methods for unbalanced data sets,which are fdata aspect and algorithm aspect.In this paper,a feature selection method is proposed,and the selection method of partial feature selection is proposed.Aiming at the unbalanced data set,the few classes are considered as positive classes,aiming to improve the classification of Minor F values.This article first gets the news text from the network.Secondly,according to the theme of news,the news text is divided into economic and non-economic categories,and the text is divided into words and lexical filters,and economic classes as a few,non-economic classes as multiple classes.Then,N feature words were extracted from the training data set according to the four feature selection methods,and each text was represented as the characteristic lexical vector.Then,according to the category of news text,this feature lexicon is marked.Finally,the classification algorithm of support vector machine is used to train classification model,and the model performance is tested.The feature selection method used in this paper includes chi-square test,information gain,mutual information and partial selection method.The experimental results show that the Minor class F value can reach 0.65 by using chi-square test,mutual information and information gain.By using the method of partial feature selection,the Minor class F value can reach 0.79.

Keywords/Search Tags:

Unbalanced data sets, Classification, Feature, Minor class F value

PDF Full Text Request

Related items

1	Pattern Recognition Method And Application Of Research Based On Default Data Sets
2	Research And Application Of Integrated Algorithms For Unbalanced Data Sets
3	Research On Classification Algorithms Of Data Mining Based On Imbalanced Data Sets
4	Unbalanced Data Based On AdaBoost-SVM Research On Classification Algorithm Of Sets
5	Research On Unbalanced Text Data Set Classification Algorithm
6	Research On Approach For Classification Of Intra-class Imbalanced Data Sets
7	Research On SVM Classification Of Unbalanced Data And Its Application In Identify Poor Students In Colleges And Universities
8	Research On Unbalanced Data Classification Based On Ensemble Learning
9	Research On Classification Method Of High-dimensional Class-imbalanced Data Sets Base On SVM
10	Feature Selection For Unbalanced Data And Emotional Dictionary Building