Feature selection and extraction for text classification

Posted on:2006-04-16

Degree:Ph.D

Type:Thesis

University:University of Waterloo (Canada)

Candidate:Bakus, Jan

Full Text:PDF

GTID:2458390005993590

Subject:Engineering

Abstract/Summary:

PDF Full Text Request

One of the inherent properties of the features in the text classification domain is the fact that features are redundant. In this domain, words are used as features, and since words overlap in meaning, the resulting features display some degree of redundancy. By selecting a feature set for the classification task with a lower redundancy, the same classification performance can be obtained with fewer features.; In this thesis, a feature selector (called the MIFS-C) that is derived from the mutual information feature selection (MIFS) algorithm is introduced. This algorithm requires an expression for the information that added by inclusion of a feature. This thesis provides an improvement in its formulation, such that the classification results are improved. An optimization is also presented that achieves a significant training time speedup over the original algorithm. The MIFS algorithms require an appropriate value for a redundancy parameter, however none of the previous works suggest how to select a suitable value. An algorithm to estimate an optimal value for this parameter is presented in this thesis.; Also a number of feature extraction techniques that generate more complex features such as phrases and collocations are investigated. However, these features add more redundancy to the feature set, so that a feature selection that reduces the redundancy in the feature set is required. Moreover, the overall findings are that little is gained (even with a sophisticated feature selector such as MIFS-C) by including such features in the feature set. Therefore, better results can be achieved by focusing on better feature selection (for example by using the MIFS-C algorithm) in conjunction with word only features, than focusing on extracting complicated features.

Keywords/Search Tags:

Feature, Classification, MIFS-C, Algorithm

PDF Full Text Request

Related items

1	Text Classification Feature Down-dimensional Method Of Research
2	Lip Features Basic Expressions Of Classification Algorithm
3	Study On Hyperspectral Remote Sensing Image Classification Based On Multiple Feature Fusion
4	Feature Selection And Classification For Imbalanced Medical Data
5	Research Of Feature Extraction Technology In KNN Text Classification Based On The Genetic Algorithm
6	Research On Text Classification Based On Feature Selection And Feature Weighting Algorithm
7	Research Of Feature Reduction And Traffic Classification Method Based On SVM
8	Image Classification Based On Segmentation Comprehensive Feature Weighted By Genetic Algorithm
9	Application Of Optimal Feature Selection Algorithm In Text Classification
10	Research On Weibo Emotion Classification Algorithm