Font Size: a A A

The Research On Feature Selection Methods For Text Classification

Posted on:2008-06-17Degree:MasterType:Thesis
Country:ChinaCandidate:J Y YuFull Text:PDF
GTID:2178360215969890Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
As the key technology in organizing and processing large mount of document data, text classification can solve the problem of information disorder to a great extent, and is convenient for user to find the required information quickly. Text classification technology becomes more and more important in our daily life and it has become one of the hot points of the forward research. Text classification is the procedure of automatically assigning predefined categories to free text documents based the documents'features.In text classification procedure, there are thousands of features; even more than the number of documents. However, it's very difficult to evaluate the statistical characteristics of samples because of the high dimensions. It will lead to"over study"and reduce classifiers'performance. So that how to select features that represent the documents well is quite necessary. Effective dimensionality reduction could make the learning task more efficient and more accurate in text classification.Feature selection and feature extraction are two common methods for dimensionality reduction. The advantage of the feature selection is that semantic information is obtained, but the performance in text classification is not excellent. Feature extraction is helpful in avoiding the problems of synonymy and polysemy, but the semantic interpretation of the features is difficult to give. Now commonly used feature selection methods for text classification are for example CHI and information gain (IG). They are greed method and sometimes can't get best result. Further more, CHI and IG will perform badly while the feature dimension is extremely low.In order to use the train set information sufficiently, we propose a new feature selection method named Class Information Feature Selection (CIFS) method in this paper. We improve OCFS method by taking the contribution within class into account besides the contribution between classes. Then combine them into one as class information, and select features according to it. Experiments showed that the new approach could capture the semantic information of the categories and performed better than OCFS and IG on MacroF1.Hierarchical classification (HC) can help people get highly topical information, so more and more researchers became focus on it. The traditional feature selection methods for flat classification are not proper for HC, because the features that we need will change with the hierarchical structure. So we must select different features on different hierarchical category to distinguish the documents. Experiments on 20NewsGroups corpus showed that hierarchical feature selection reduce feature dimension effectively and can get better performance.
Keywords/Search Tags:Feature selection, distribution between categories, distribution within class, class information, text classification, hierarchical classification, hierarchical feature selection
PDF Full Text Request
Related items