Study On Text Categorization Based On Support Vector Machines

Posted on:2009-08-26

Degree:Master

Type:Thesis

Country:China

Candidate:X Q Xu

Full Text:PDF

GTID:2178360242481354

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

Text categorization is a hotspot of data mining research field, whose task is toassign a Boolean value to each pair j, c_i>âˆˆDÃ—C, where D is a domain ofdocuments and C={c₁,â€¦,c|c|} is a set of predefined categories. Of all the maintext categorization algorithms, SVM is proved to have good performance onboth the accuracy and the efficiency.SVM is proposed for solving binary classification problems, however, themulti-class classification is much more popular in the practical application.How to apply SVM in multi-class classification with its good performanceremains a critical problem. An effective SVM-based multi-class classificationalgorithm is still in need, especially for the large number of classes.At present, two methods are used in SVM multi-class classification: oneis to union the parameters calculating of several Optimal Hyper-plane into oneoptimization problem, thus the multi-class classification can be done bysolving the optimization problem; the other is to partition the multi-classclassification problem into several binary classification problems, andcombine the binary classifiers according to some strategy.Based on the second method mentioned above, the thesis proposes a newTree-Structured SVMs using concept vector space model. Experiments showthat it does improve in aspects of accuracy and efficiency.Chapter 1 introduces the background of the topic, and the current studyresults in this field. A better multi-class Text Categorization is the practicalneeds of the real world, as well as a hotspot by the researchers. The thesisstudies the existing text categorization algorithms, and introduce ontology intothe traditional algorithm, so as to propose a new algorithm CVSM-SVMs. Thealgorithm is then implemented to compare with other algorithms.Chapter 2 introduces the relation knowledge of text categorization andSVM, including the concept, procedure and evaluation criteria of textcategorization, the theoretical foundation, related definitions and solvingprocedure of Support Vector Machines. The current approaches to multi-classtext categorization based on SVM are high-lighted. Chapter 3 is the focus of the thesis. The main idea and the algorithmprocess are introduced in this chapter. There are two strategies of SVMs formulti-class classification. One is to union the parameters calculating of severalOptimal Hyper-plane into one optimization problem, thus the multi-classclassification can be done by solving the optimization problem. It seemssimple and clear, but it uses so many variables that it makes theimplementation difficult and the computing high complexity. If there are manyclasses, the straining speed is low, and the classification accuracy is not goodeither. The other is to partition the multi-class classification problem intoseveral binary classification problems, and combine the binary classifiersaccording to some strategy. The thesis is based on the second strategy.The partition strategy should solve two problems: how to separate thetraining data into two parts and how to combine the trained binary classifiers.As to how to separate data, there are several possible ways, such as 1-a-1, 1-a-r,class average vector, clustering, etc. The thesis adopts clustering method. As tohow to combine, possible ways are: maximum output, voting, directed acyclicgraph, tree structure, and so on. Same partition method could use differentcombine methods, an example is 1-a-1 algorithm and DAG algorithm. We usetree structure in the thesis.A feature of tree-structured SVMs is the error accumulation problem.That is, if a sample is wrong classified by the higher-level classifier, the errorwould pass to the lower-level classifier, which makes the sample further andfurther from the correct category.To improve the effect of clustering, we introduce ontology intotree-structured SVMs. Traditional SVMs use terms to be the features andassume the features are linearly independent. However, words are closelyrelated in the natural language. The assumption is seldom satisfied. The newalgorithm proposed in the thesis adopts concept vector space, in whichsynonyms are mapping to a same concept, so that the similar words are gettingcloser, and the class borders are clearer. Besides, it avoids the weight reduce ofa feature led by dispersion of terms used. Consequently, the concept vectorhelps to improve the accuracy of both the clustering and the trainingprocedure. Chapter 4 gives the experiment environment and the results, whichproves the CVSM-SVMs does improve the classification accuracy and theefficiency. Experiment corpus are retrieved from Reuters RCV1-v2. Theconcept vector are extracted usingWordNet.Experiments on datasets of different sizes show that CVSM-SVMsperforms better than traditional algorithms based on terms. The accuracy isenhanced by 2% and the efficiency is also improved due to the decrease ofSupport Vectors.How to use different-level concepts in different-level clustering and howto build better-structured trees would be considered in further researches.

Keywords/Search Tags:

Categorization

PDF Full Text Request

Related items

1	A Study On Text Categorization Based On Machine Learning
2	Research And Implementation Of Chinese Text Categorization Methods Based On Tree-like Keywords Set
3	Research On Perception-Oriented Image Scene And Emotion Categorization
4	A Study On M3-kNN Network And Application In Text Categorization
5	Research Of Hierarchical Text Categorization System Based On VSM And Rule Matching
6	Study On Text Categorization Based On Support Vector Machines
7	Research On Text Categorization Based On Modified Bayes Method And Its Application In NERMS
8	Information Retrieval Using Categorization Structures
9	Categorization during typical development
10	Research And Implementation Of The Automatic Chinese Text Categorization