Font Size: a A A

Research On Support Vector Machines Classification Algorithm In Text Categorization

Posted on:2008-10-22Degree:MasterType:Thesis
Country:ChinaCandidate:S R ZhangFull Text:PDF
GTID:2178360242467573Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Support vector machine (SVM), as a new machine learning method based on statistical learning theory, has attracted more and more attention and became a hot issue in the field of machine learning, because it not only well resolves such practical problems as non-linearity, high dimension, small sample sets and local minima, but also has a higher generalization than that of artificial neural networks. Text categorization is a key technique in content-based automatic information management. Text vectors are high dimensional and extremely sparse. SVM is particularly suited for text categorization and has great potential in text categorization, as SVM is not sensitive to sparse data, and has advantages in dealing with high dimensional problems. However, SVM also has problems to solve: for example, while samples are overlapped seriously, they may greatly increase the burden of computation and may lead to over learning and decrease the generalization ability. This paper mainly focuses on drawbacks of SVM especially that in text categorization and the main work is as follows:Firstly, the accuracy of classification of SVM in a two-class classification problem would be decreased because of those promiscuous samples. KCNN-SVM is proposed in this paper as an improved NN-SVM algorithm, which prunes a sample according to its nearest neighbor's class label as well as the average distance in kernel space between it and its k congener nearest neighbors. Experimental Results show that KCNN-SVM algorithm is better than both SVM and NN-SVM in accuracy of classification and the total training and testing time is comparative to that of NN-SVM.Secondly, although SVM has a good performance at classification using all dimensions of text vectors, in some special situation, for example, in order to higher speed and accuracy of classification, the dimension of the text vector should be reduced. Latent Semantic Indexing (LSI) is a popular dimension reduction method. In this paper, a new classification model is proposed: At first, the dimension of text vectors is reduced by LSI and KCNN-SVM algorithm is used to prune the training set which is after reduction. At last, the new sample set obtained is trained by SVM. Experimental Results show that the new classification model is better than SVM in accuracy of classification and it is not sensitive to the dimension of the original samples and punishment factor in SVM.
Keywords/Search Tags:Support Vector Machine, kernel space, Latent Semantic Indexing, text categorization
PDF Full Text Request
Related items