Font Size: a A A

Research On Feature Dimension Reduction In Text Classification

Posted on:2013-01-10Degree:MasterType:Thesis
Country:ChinaCandidate:B H WanFull Text:PDF
GTID:2248330362973754Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
With the popularity and rapid development of Internet, electronic documents on thenetwork have been increasing rapidly, while it is difficult for users to find, filter andmanage these vast amounts of information. Text classification techniques, therefore,have drawn the sustaining attention.Text classification involved five procedures: texts preprocess, feature dimensionreduction, feature weight, classifier train, and classifier performance assessment. Intexts preprocess, the original set of feature words can be obtained by cutting words andremoving stop words in a set of text. Afterward, feature dimension reduction wasintroduced to select the feature words which can distinguish the text category strongly,and calculated the weights of each feature words, using the formula of feature weight.Also, the text is represented as a space vector which consisted of a certain number offeature words, according to the vector space model (VSM). Finally, the classifier wasgained by classifier train, and its performance was evaluated by the relevant indicators.Among them, the feature dimension reduction plays a significant role during thetext classification process. Using a good feature dimension reduction to reduce thedimension of the vector space can not only improve the speed of the classifier and savestorage space, but also filter some irrelevant properties, thereby reducing theinterference of irrelevant information for text classification and improving the accuracyof text classification. The feature dimension reduction can be divided into featureselection and feature extraction, according to the different methods of generating thenew feature word. In general, feature selection methods include document frequency(DF), mutual information (MI), information gain (IG), chi-square statistic (CHI), weightof evidence for text (WET),odds ratio (OR), and a combination of many methods. Thebasic idea of these feature selection method is to score each feature word by utilizingsome kind of evaluation function. Then, the feature words were sorted from high scoreto low. At last, top scores of feature words were selected to constitute a feature set.Analysis of common feature selection methods indicates two drawbacks of suchmethods. On one hand, these methods fail to take into consideration of word frequency,being prone to choose less frequent word. On the other hand, the relationship betweenfeature words and categories is largely ignored. This paper proposes a novel featureselection method to overcome the drawbacks mentioned above. The evaluation function value of feature words were calculated by taking account of text concentration amongclasses, dispersion within the text classes and word frequency concentration amongclasses. The difference of the maximum and the second largest value was considered asthe final assessment value of feature words.In the experiment, the proposed method was compared with document frequency,mutual information, information gain and chi-square statistic; and the recall, precisionand F1values were employed to evaluate the effect of classification. The new featureselection method considered not only the frequency, and the relationship between thefeature words and categories, but had less computational complexity. Therefore, a betterfeature dimension reduction effect was obtained. Meanwhile both MI and CHI utilizethe difference between the maximum and the second largest value as the globalevaluation function value, which in turn verifies the effectiveness of MI and CHI.
Keywords/Search Tags:Text Classification, Vector Space Model, Feature Dimension Reduction, EvaluationFunction
PDF Full Text Request
Related items