Font Size: a A A

Normal Weight Based Feature Selection Method In SVM Text Categorization

Posted on:2011-08-06Degree:MasterType:Thesis
Country:ChinaCandidate:H JiangFull Text:PDF
GTID:2178360308952518Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
With the rapid growth of Internet, text classification has been one of the key tasks of organizing on-line information, and have become the key component of lots of applications. Compare to other learning algorithm, SVM learning algorithm performances better in text classification.For text classification based on SVM learning algorithm, usually there is an abundance of training data, which will cost a lot of computing resources in training process. So training of classifiers cannot be performed over the full set of data due to limited computing resources. Under this situation, it is significant to introduce feature selection methods.This paper introduces a feature selection method based on the weight of normal from SVM model, and applies this method to text classification based on SVM learning algorithm. This feature selection method provides an effective way to maintain the classification performance while reducing the dimension of feature space and then significantly enhances the efficiency of computing resource . This paper will perform the research on following points:Firstly, in order to describe the cost of computing resources in SVM training process, we introduce the concept of"sparsity". Sparsity is here defined as the average number of non-zero components in the vector representation of data. Sparsity of vector impact the cost of computing resources directly, the resources here involve both system memory and time cost for computing.Secondly, introduce a feature selection method based on the weight of normal from SVM model. This feature selection method is first to train linear SVM on a subset of training data to create initial classifiers, then taking the weight of normal from SVM model as the measure of features , by which features are sorted.Thirdly, when the computing resources are limited, for the following two situations, eliminating part of features by feature selection method to retain as much training data as possible, eliminating part of training data to retain as many features as possible, compare the performance of text classification.Fourthly, for linear SVM classifier, explore the performance of normal-based feature selection method by comparing it with two traditional feature selection methods: odd ratio and information gain.Experimental results show that for linear SVM classifier , compare to eliminating part of training data to retain as many features as possible, it will performance better by eliminating part of features by feature selection method to retain as much training data as possible. Which provides a strong theoretic evidence for performing feature selection when the computing resources are limited. At the same time, compare to traditional feature selection methods : odd ratio and information gain, the normal-based method yields better classification performance. This feature selection method provides an effective way to maintain the classification performance while reducing the dimension of feature space and then significantly enhances the efficiency of computing resource .
Keywords/Search Tags:Text categorization, Feature selection, Support Vector Machine, Resource Constraint
PDF Full Text Request
Related items