Font Size: a A A

Study On Multi-class Text Classification Based On Support Vector Machines

Posted on:2011-09-10Degree:MasterType:Thesis
Country:ChinaCandidate:L ChenFull Text:PDF
GTID:2178360308459019Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
With the sharp development of the computer information technology, the electronic documents on internet increase rapidly. The tasks of mining abundant text information and especially correctly classifying them in terms of predefined semantic have become the important problems for organizing and managing text information. They are called text classification, commonly considered as a key mission of text mining. SVM (Support vector machine) is a new pattern recognition method developed from the middle of 1990s on statistical leaning theory by Vapnik. It's a new tool for machine learning by using optimization method. SVM is characterized by the use of convex optimization the sparseness of the solution, a maximal margin hyper-plane, Mercer's theorem, the theory of kernels and the absence of local minima. Because support vector machine processes advantages of simpler structure, global optimization, better generalization ability etc, it is widely researched and applied to pattern recognition and text classification. The support vector machines approach is originally designed to solve binary classification problems, but we need to solve multi-category classification problems. How to effectively extend SVM for multi-category classification and apply it to text classification is a key research issue in this paper.In this paper, the concept,process and methods of text mining are first introduced, then a general overview of existing representative methods for multi-class support vector machines algorithms is studied and analyzed deeply. And their pluses and minuses are given and compared with each other. DDAGSVM (Decision directed acyclic graph support vector machines) is an important and popular method of multi-category support vector machines. We propound improving strategies to algorithms of DDAGSVM and apply the improved method to text classification in text mining as the key job. The main work and innovation as follows:①An overview of theories and techniques for text mining is presented. This paper focuses on the research of text classification and carries on in-depth analysis to the basic theories and algorithm of SVM. The main progresses on development of SVM are analyzed. And then methods of support vector machines for multi-category problems are proposed. Several popular methods have been introduced including "one-against-one", DDAGSVM, "one-against-all", M-ary SVM and binary tree multi-category SVM. Also their properties are analyzed and their merits and demerits are compared.②Furthermore, DDAGSVM for multi-category classification is researched. The processes of making decisions on the DDAGSVM were random. For this reason, this paper proposes an improving strategy of DDAGSVM based on inducting an internal-class degree of dispersion. An external-class inseparable measure is defined based on the distribution of the training samples to form the classes'separating sequences. An improved algorithm having greater classification distance for DDAGSVM is generated.③Multi-category dates from UCI are selected for text classification experiments. The experimental results show that the improved algorithm has higher multi-category classification accuracy than the original decision directed acyclic graph multi-category support vector machines.
Keywords/Search Tags:Text Mining, Text Classifier, Support Vector Machine, Decision Directed Acyclic Graph, Multi-category Classification
PDF Full Text Request
Related items