Font Size: a A A

Study On Feature Selection And Feature Weighting Of Chinese Text Classification

Posted on:2013-06-08Degree:MasterType:Thesis
Country:ChinaCandidate:Y WangFull Text:PDF
GTID:2248330362474562Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
With the rapid development of information technology and Internet, electronicdocument information has increased dramatically, text classification becomes the keytechnique on organizing and processing large scale of texts. But the text data has aobvious natural attribute that the data is often imbalanced, that is a great disparity of thenumber between categories and negative class(big class) may be hundreds of times thanpositive class(small class). This problem is easy to cause the classifier to be moreinclined to the negative class while ignoring the positive class, the document whichbelongs to the positive class may be wrongly assigned to the negative class. Thisproblem leads to lower classification accuracy of the positive class, and affect theperformance of the classifier in the end. At present, the imbalanced data classificationproblem has already become a hot spot in the field of data mining.There are three reasons why the classifier tends to the negative class while ignoringthe positive class on the imbalanced data set. Firstly, the imbalanced class distribution isquite common in many real applications of automatic text classification. Secondly, thedefects of classification algorithm make the classifier not suit to the imbalanced data set.Thirdly, The existing feature selection and feature weighting methods are more inclinedto the features in the negative class. About the first two parts, there many algorithmshave been discussed, but the research on the methods of feature selection and featureweighting is not sufficient. Therefore, to seek some effective feature selection andfeature weighting methods which could be adapt to the relatively balanced orimbalanced data sets is the key problem on text classification.Firstly, aiming at the shortage of Information Gain ignores the term frequencydistribution of feature inside class and document distribution of feature among classes, afactor which measures term frequency and document distribution of features isintroduced for distinguishing features of strong correlation with class, and aiming at theproblem that Information Gain is inclined to the negative features on the imbalanceddata set, a fator is introduced to reduce the contribution of these features. Then,considering the feature distribution characteristics in the positive and negative classes,synthesizes four measures on the category distinguishing ability of features, a newfeature selection method based on the distribution ratio of features is proposed. Finally,aiming at the problem that TF-IDF feature weighting method does not consider the feature distribution characteristics between positive class and negative class, whichleads the classifier to assign the larger weight to rare features and smaller weight to thefeatures with better ability to distinguish class, an improved TF-IDF formula isproposed.In order to verify effectiveness of the feature selection methods are proposed andthe improved formula of TF-IDF, the experiments using relatively balanced data set andimbalanced data set on the chinese text classification experiment platform have beentaken. The results on different data sets show that the feature selection methods areproposed have achieved better effect for reducing dimension, while the improvedformula of TF-IDF achieved better effect than the TF-IDF, the performance of theclassifier is improved.
Keywords/Search Tags:Text Classification, Imbalanced Data Set, Feature Selection, FeatureWeighting
PDF Full Text Request
Related items