Research On Feature Dimension Reduction In Text Classification

Posted on:2013-01-10

Degree:Master

Type:Thesis

Country:China

Candidate:B H Wan

Full Text:PDF

GTID:2248330362973754

Subject:Computer system architecture

Abstract/Summary:

PDF Full Text Request

With the popularity and rapid development of Internet, electronic documents on thenetwork have been increasing rapidly, while it is difficult for users to find, filter andmanage these vast amounts of information. Text classification techniques, therefore,have drawn the sustaining attention.Text classification involved five procedures: texts preprocess, feature dimensionreduction, feature weight, classifier train, and classifier performance assessment. Intexts preprocess, the original set of feature words can be obtained by cutting words andremoving stop words in a set of text. Afterward, feature dimension reduction wasintroduced to select the feature words which can distinguish the text category strongly,and calculated the weights of each feature words, using the formula of feature weight.Also, the text is represented as a space vector which consisted of a certain number offeature words, according to the vector space model (VSM). Finally, the classifier wasgained by classifier train, and its performance was evaluated by the relevant indicators.Among them, the feature dimension reduction plays a significant role during thetext classification process. Using a good feature dimension reduction to reduce thedimension of the vector space can not only improve the speed of the classifier and savestorage space, but also filter some irrelevant properties, thereby reducing theinterference of irrelevant information for text classification and improving the accuracyof text classification. The feature dimension reduction can be divided into featureselection and feature extraction, according to the different methods of generating thenew feature word. In general, feature selection methods include document frequency(DF), mutual information (MI), information gain (IG), chi-square statistic (CHI), weightof evidence for text (WET),odds ratio (OR), and a combination of many methods. Thebasic idea of these feature selection method is to score each feature word by utilizingsome kind of evaluation function. Then, the feature words were sorted from high scoreto low. At last, top scores of feature words were selected to constitute a feature set.Analysis of common feature selection methods indicates two drawbacks of suchmethods. On one hand, these methods fail to take into consideration of word frequency,being prone to choose less frequent word. On the other hand, the relationship betweenfeature words and categories is largely ignored. This paper proposes a novel featureselection method to overcome the drawbacks mentioned above. The evaluation function value of feature words were calculated by taking account of text concentration amongclasses, dispersion within the text classes and word frequency concentration amongclasses. The difference of the maximum and the second largest value was considered asthe final assessment value of feature words.In the experiment, the proposed method was compared with document frequency,mutual information, information gain and chi-square statistic; and the recall, precisionand F1values were employed to evaluate the effect of classification. The new featureselection method considered not only the frequency, and the relationship between thefeature words and categories, but had less computational complexity. Therefore, a betterfeature dimension reduction effect was obtained. Meanwhile both MI and CHI utilizethe difference between the maximum and the second largest value as the globalevaluation function value, which in turn verifies the effectiveness of MI and CHI.

Keywords/Search Tags:

Text Classification, Vector Space Model, Feature Dimension Reduction, EvaluationFunction

PDF Full Text Request

Related items

1	Dimension Reduction Method Research In Text Classification
2	Tendentious Classification System For Chinese Text
3	Research On Text Clustering Based On Text Dimension Reduction And Ant Colony Algorithm
4	Text Categorization And Feature Dimension Reduction Research
5	Research And Application Of Feature Dimension Reduction Algorithm In Text Classification
6	Research And Application Of Text Feature Reduction And Classification Rule Extraction
7	Research And Implementation Of Text Classification System Based On VSM
8	The Research And Implement Of Automatic Text Classification System Which Is Based On Vector Space Model
9	Text Emotional Classification Based On Text Mining
10	Research On Data Mining Of Online Recruitment Information