Normal Weight Based Feature Selection Method In SVM Text Categorization

Posted on:2011-08-06

Degree:Master

Type:Thesis

Country:China

Candidate:H Jiang

Full Text:PDF

GTID:2178360308952518

Subject:Communication and Information System

Abstract/Summary:

PDF Full Text Request

With the rapid growth of Internet, text classification has been one of the key tasks of organizing on-line information, and have become the key component of lots of applications. Compare to other learning algorithm, SVM learning algorithm performances better in text classification.For text classification based on SVM learning algorithm, usually there is an abundance of training data, which will cost a lot of computing resources in training process. So training of classifiers cannot be performed over the full set of data due to limited computing resources. Under this situation, it is significant to introduce feature selection methods.This paper introduces a feature selection method based on the weight of normal from SVM model, and applies this method to text classification based on SVM learning algorithm. This feature selection method provides an effective way to maintain the classification performance while reducing the dimension of feature space and then significantly enhances the efficiency of computing resource . This paper will perform the research on following points:Firstly, in order to describe the cost of computing resources in SVM training process, we introduce the concept of"sparsity". Sparsity is here defined as the average number of non-zero components in the vector representation of data. Sparsity of vector impact the cost of computing resources directly, the resources here involve both system memory and time cost for computing.Secondly, introduce a feature selection method based on the weight of normal from SVM model. This feature selection method is first to train linear SVM on a subset of training data to create initial classifiers, then taking the weight of normal from SVM model as the measure of features , by which features are sorted.Thirdly, when the computing resources are limited, for the following two situations, eliminating part of features by feature selection method to retain as much training data as possible, eliminating part of training data to retain as many features as possible, compare the performance of text classification.Fourthly, for linear SVM classifier, explore the performance of normal-based feature selection method by comparing it with two traditional feature selection methods: odd ratio and information gain.Experimental results show that for linear SVM classifier , compare to eliminating part of training data to retain as many features as possible, it will performance better by eliminating part of features by feature selection method to retain as much training data as possible. Which provides a strong theoretic evidence for performing feature selection when the computing resources are limited. At the same time, compare to traditional feature selection methods : odd ratio and information gain, the normal-based method yields better classification performance. This feature selection method provides an effective way to maintain the classification performance while reducing the dimension of feature space and then significantly enhances the efficiency of computing resource .

Keywords/Search Tags:

Text categorization, Feature selection, Support Vector Machine, Resource Constraint

PDF Full Text Request

Related items

1	A Study On Text Categorization Based On Machine Learning
2	Research On Chinese Text Categorization Based On Support Vector Machine
3	The Research And Implementation Of Automatic Text Categorization For Chinese Web Documents
4	Research On Chinese Text Categorization
5	Research On Text Categorization Based On Support Vector Machine
6	Research On The Feature Selection Technique For Text Categorization
7	Research On Text Classification Based On Feature Selection And Its Application
8	The Studies On Chinese Text Categorization Based On Pso And Svm
9	Multi-class Scientific Literature Automatic Categorization System
10	The Research And Implementation Of Chinese Text Categorization