| One of the inherent properties of the features in the text classification domain is the fact that features are redundant. In this domain, words are used as features, and since words overlap in meaning, the resulting features display some degree of redundancy. By selecting a feature set for the classification task with a lower redundancy, the same classification performance can be obtained with fewer features.; In this thesis, a feature selector (called the MIFS-C) that is derived from the mutual information feature selection (MIFS) algorithm is introduced. This algorithm requires an expression for the information that added by inclusion of a feature. This thesis provides an improvement in its formulation, such that the classification results are improved. An optimization is also presented that achieves a significant training time speedup over the original algorithm. The MIFS algorithms require an appropriate value for a redundancy parameter, however none of the previous works suggest how to select a suitable value. An algorithm to estimate an optimal value for this parameter is presented in this thesis.; Also a number of feature extraction techniques that generate more complex features such as phrases and collocations are investigated. However, these features add more redundancy to the feature set, so that a feature selection that reduces the redundancy in the feature set is required. Moreover, the overall findings are that little is gained (even with a sophisticated feature selector such as MIFS-C) by including such features in the feature set. Therefore, better results can be achieved by focusing on better feature selection (for example by using the MIFS-C algorithm) in conjunction with word only features, than focusing on extracting complicated features. |