Research On Support Vector Machines Classification Algorithm In Text Categorization

Posted on:2008-10-22

Degree:Master

Type:Thesis

Country:China

Candidate:S R Zhang

Full Text:PDF

GTID:2178360242467573

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

Support vector machine (SVM), as a new machine learning method based on statistical learning theory, has attracted more and more attention and became a hot issue in the field of machine learning, because it not only well resolves such practical problems as non-linearity, high dimension, small sample sets and local minima, but also has a higher generalization than that of artificial neural networks. Text categorization is a key technique in content-based automatic information management. Text vectors are high dimensional and extremely sparse. SVM is particularly suited for text categorization and has great potential in text categorization, as SVM is not sensitive to sparse data, and has advantages in dealing with high dimensional problems. However, SVM also has problems to solve: for example, while samples are overlapped seriously, they may greatly increase the burden of computation and may lead to over learning and decrease the generalization ability. This paper mainly focuses on drawbacks of SVM especially that in text categorization and the main work is as follows:Firstly, the accuracy of classification of SVM in a two-class classification problem would be decreased because of those promiscuous samples. KCNN-SVM is proposed in this paper as an improved NN-SVM algorithm, which prunes a sample according to its nearest neighbor's class label as well as the average distance in kernel space between it and its k congener nearest neighbors. Experimental Results show that KCNN-SVM algorithm is better than both SVM and NN-SVM in accuracy of classification and the total training and testing time is comparative to that of NN-SVM.Secondly, although SVM has a good performance at classification using all dimensions of text vectors, in some special situation, for example, in order to higher speed and accuracy of classification, the dimension of the text vector should be reduced. Latent Semantic Indexing (LSI) is a popular dimension reduction method. In this paper, a new classification model is proposed: At first, the dimension of text vectors is reduced by LSI and KCNN-SVM algorithm is used to prune the training set which is after reduction. At last, the new sample set obtained is trained by SVM. Experimental Results show that the new classification model is better than SVM in accuracy of classification and it is not sensitive to the dimension of the original samples and punishment factor in SVM.

Keywords/Search Tags:

Support Vector Machine, kernel space, Latent Semantic Indexing, text categorization

PDF Full Text Request

Related items

1	Research On Text Classification Based On Ontology And Latent Semantic Indexing Algorithm
2	Research On Text Classification Filtering Technology Based On Latent Semantic Indexing And Support Vector Machine
3	Research On The Method Of Text Categorization Based On Semantic Similarity
4	Research On Web Text Categorization Based On Latent Semantic Analysis
5	Research On Chinese Text Categorization Based On Support Vector Machine
6	Text Classification Research Based On Support Vector Machine
7	Automatic Classification Research On Chinese Web Document Orientation
8	The Research Of Optimization Technology In Latent Semantic Indexing Based On Pseudo Text
9	A Latent Semantic Indexing Differences Model And Its Application
10	The Implementation And Research Of The Probabilistic Latent Semantic Analysis Model In The Search Engine's Business Text Classification System