Study On Feature Selection And Feature Weighting Of Chinese Text Classification

Posted on:2013-06-08

Degree:Master

Type:Thesis

Country:China

Candidate:Y Wang

Full Text:PDF

GTID:2248330362474562

Subject:Computer system architecture

Abstract/Summary:

PDF Full Text Request

With the rapid development of information technology and Internet, electronicdocument information has increased dramatically, text classification becomes the keytechnique on organizing and processing large scale of texts. But the text data has aobvious natural attribute that the data is often imbalanced, that is a great disparity of thenumber between categories and negative class(big class) may be hundreds of times thanpositive class(small class). This problem is easy to cause the classifier to be moreinclined to the negative class while ignoring the positive class, the document whichbelongs to the positive class may be wrongly assigned to the negative class. Thisproblem leads to lower classification accuracy of the positive class, and affect theperformance of the classifier in the end. At present, the imbalanced data classificationproblem has already become a hot spot in the field of data mining.There are three reasons why the classifier tends to the negative class while ignoringthe positive class on the imbalanced data set. Firstly, the imbalanced class distribution isquite common in many real applications of automatic text classification. Secondly, thedefects of classification algorithm make the classifier not suit to the imbalanced data set.Thirdly, The existing feature selection and feature weighting methods are more inclinedto the features in the negative class. About the first two parts, there many algorithmshave been discussed, but the research on the methods of feature selection and featureweighting is not sufficient. Therefore, to seek some effective feature selection andfeature weighting methods which could be adapt to the relatively balanced orimbalanced data sets is the key problem on text classification.Firstly, aiming at the shortage of Information Gain ignores the term frequencydistribution of feature inside class and document distribution of feature among classes, afactor which measures term frequency and document distribution of features isintroduced for distinguishing features of strong correlation with class, and aiming at theproblem that Information Gain is inclined to the negative features on the imbalanceddata set, a fator is introduced to reduce the contribution of these features. Then,considering the feature distribution characteristics in the positive and negative classes,synthesizes four measures on the category distinguishing ability of features, a newfeature selection method based on the distribution ratio of features is proposed. Finally,aiming at the problem that TF-IDF feature weighting method does not consider the feature distribution characteristics between positive class and negative class, whichleads the classifier to assign the larger weight to rare features and smaller weight to thefeatures with better ability to distinguish class, an improved TF-IDF formula isproposed.In order to verify effectiveness of the feature selection methods are proposed andthe improved formula of TF-IDF, the experiments using relatively balanced data set andimbalanced data set on the chinese text classification experiment platform have beentaken. The results on different data sets show that the feature selection methods areproposed have achieved better effect for reducing dimension, while the improvedformula of TF-IDF achieved better effect than the TF-IDF, the performance of theclassifier is improved.

Keywords/Search Tags:

Text Classification, Imbalanced Data Set, Feature Selection, FeatureWeighting

PDF Full Text Request

Related items

1	Research On Feature Selection And Weighting Method For Chinese Text Classification
2	Research On Feature Selection And Feature Weighting Of Text Classification
3	Research On Imbalanced Text Classification
4	Study On Feature Selection Reselected By Term Frequency In Text Classification
5	Feature Selection And Classification For Imbalanced Medical Data
6	Research On Methods For Imbalanced Data Classification
7	Research On Feature Selection Algorithm On Imbalanced Data Classification
8	Text Classification Algorithm Based On Imbalanced Data Sets
9	Research Of Imbalanced Text Tendency Classification For Network Public Opinion Based On Three-way Decisions
10	Text Categorization And Feature Dimension Reduction Research