Font Size: a A A

Research On Chinese Text Categorization Based On Support Vector Machine

Posted on:2010-08-11Degree:MasterType:Thesis
Country:ChinaCandidate:F JiangFull Text:PDF
GTID:2178360278962391Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
With the rapid development and the increasing popularity of the Internet,more and more information by way of electronic documents exist in the Internet. How to extract valuable knowledge from the massive potential documents has become a major information-processing goal. As an important aspect in the field of information–processing, text categorization has become a major research direction. Using text categorization techniques, documents can be automatically dealt with in accordance with the classificatory of organizations and to facilitate accurate positioning of the people the information needed. At the same time, as information filtering,information retrieval,search engines and other areas of technology infrastructure, text categorization techniques have broad application prospects.Categorization algorithm is the most critical factor to text categorization system performance. Support vector machine is a new machine learning technique developed from statistical learning theory by Vapnik. Support vector machine is widely investigated and used for text categorization because of its good generalization performance, the global optimum and simple structure.In this paper, we research on the text categorization problem and carry out depth research on support vector machine kernel function. After analyzing the traditional polynomial kernel function, for the polynomial kernel's poor study performance, we combine conditionally positive definite kernel which it has high study performance with Polynomial kernel as an improved polynomial SVM classifier for text categorization. In this paper to do the following work:①Discuss some of the key techniques in text categorization field: text feature selection algorithm, feature weighting and categorization algorithms. Compared the advantages and disadvantages commonly used feature selection algorithm and categorization algorithm in text categorization.②Introduce a kind of kernels which are not satisfied with Mercer conditions, but it can be used for kernel study. Analyzed the advantages and disadvantages of such conditionally positive definite kernel and use it in text categorization field.③Analysis of the characteristics of polynomial kernel function, for the polynomial kernel function's poor learning performance, we use conditionally positive definite kernel which has good study ability constitute a mixed kernel function as a means of improving polynomial kernel function. Improved polynomial kernel SVM text classifier not only has good generalization performance, but also has good learning performance, at the same time its structure have the inner contact with the text vector similarity measure.④In order to verify the improvement approach , we use the improved polynomial kernel function and the polynomial kernel function in the same data sets to do a comparison of experiment, the experiment results showed the improved polynomial kernel SVM text classifier is superior to the polynomial kernel SVM text classifier.⑤In the course of the experiment ,we found that first- factorial polynomial kernel function and second- factorial conditionally positive definite kernel function in three different data sets has always been the same classify effect, for which a conjecture is proposed in this paper: first- factorial polynomial kernel function as support vector machine's kernel function is equivalent to second- factorial conditionally positive definite kernel function as support vector machine's kernel function.
Keywords/Search Tags:Support Vector Machines, Polynomial Kernel, Conditionally Positive Definite Kernel, Text Categorization, Feature Selection
PDF Full Text Request
Related items