Font Size: a A A

Research On SVM And Text Classification

Posted on:2007-07-19Degree:MasterType:Thesis
Country:ChinaCandidate:X X NiuFull Text:PDF
GTID:2178360182480726Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid growth of the Word Wide Web, the task of classifying natural language document into a predefined set of semantic categories has become one of the key methods for organizing online information. This task is commonly referred to as text classification. The exponential growth of the number of online documents and the increase pace with which information needs to be distributed has created the need for automatic document classification.The approach presented in this thesis is based on the key sight that margin, the complexity measure used in support vector machine (SVM), is ideal for text classification. The learning algorithm is given access to the labeled training documents and produces a classification rule automatically. The main work is as follows:1. Based on the introduction of representations of text, feature selection, and criteria for evaluating predictive performance, we implement the processing steps like stemming, high and low frequency words removal, and weighting schemes to generate our feature dictionary and transform training and testing documents into numerical vectors. Then the text classification experimental system based on SVM is designed. Tested on Ruters-21578 corpus, the system demonstrates that SVM can efficiently, effectively and provably solve the challenge of learning text classifiers from examples for a large and well-defined class of problems.2. In order to solve overfitting and time consuming for training in SVM, SVM combining subtractive clustering method is proposed in this thesis. Subtractive clustering method is used to select a set of cluster centers which are the data samples themselves as the representation of original massive set of training data. The new training set then is used to construct support vector machines. Two benchmarks on two-class recognition and multi-class problem are tested, and the results show that the SVM based on subtractive clustering have better or equal classification accuracy and generalization ability with smaller set of training data and cost less optimization computation time than conventional support vector machines.
Keywords/Search Tags:Text classification, Feature selection, Support vector machine, Subtractive clustering
PDF Full Text Request
Related items