Font Size: a A A

The Study Of Text Classification Based On Support Vector Machine

Posted on:2012-02-12Degree:MasterType:Thesis
Country:ChinaCandidate:G J WuFull Text:PDF
GTID:2218330338970787Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the continuous improvement of Information Age, more and more people are depending on the Internet. Millions of electronic information are generated every day,it becomes a big challenge to classify the information effectively. The appearance of Data Mining technique guides the way that solve text classification for people. Among this, the technology of Automatic text categorization is one of the important branches. As people constantly understanding of text classification, more and more methods are applied in test classification, such as Simple Bayesian method-nearest neighbor, Method of Maximum Entropy and so on. Currently there is a new method that more and more people are researching, that is SVM. Vapnik is the first person to put forward the method of SVM and it is a statistical method.SVM also inherited many merits in that the statistical method is applied in Machine learning,and it shows powerful performance in solving nonlinear problem of small sample.But the traditional method of SVM mainly solve the classification of two kind of problem. Many scholars are focusing on how to expand the method to many kinds of classification, and that also become the focus of research in this essay.On the basis of introduction of the relevant technology about Text Mining and Text classification in the essay, it highlights some methods about how to construct SVM, analysis their merits and demerits, and then proposes an improved SVM. And it is verified in the experiment that it has a very good effect. The essay mainly focus on the research on the following aspects:①Introduction about relevant technology of Text Mining and Text Classification, including Text pretreatment,Text presentation,Text features of extraction and eigenvalue computing. In the essay,it uses the method of VSM to express Text and uses the formula of TF-IDF to compute the weight of eigenvalue. After the introduction of the basic knowledge, it gives the example of some common methods about Text Classification:Simple Bayesian, clustering center vector, K neighbor, the maximum entropy and support vector machine (SVM) method, as well as it analyses and compares the advantages and disadvantages of these methods.②Introduce the theoretical knowledge of support vector machine, analyzes and expounds the advantages of support vector machine applied in machine learning as a statistical methods. And then it recommends the basic principle and common methods that SVM is applied in classification, the common methods includes Vector space mapping, kernel functions choice and so on. Finally the essay mainly focuses on some common methods of the construction of points SVM:One-on-one, a pair of many, decision-making guidance the circulation chart and binary tree method. It analyzes the methods'performance for classification and then it mainly explains that Binary tree classification method has best performance compared to other three methods.③The core of this essay is that it proposes an improved binary tree structure of points support vector machine (SVM) method. Before the introduction of improved algorithm, the essay first recommends two kinds of typical binary tree generation algorithms:Partial binary tree and completely or approximate fully binary tree. It finds that the two kinds of binary tree have its own strengths in model training, classification accuracy and classification efficiency properties after making comparison. After analysis of the two methods of the Binary tree structure, it introduces the improved binary tree produces algorithm. The structure of binary tree constructed by the improved algorithm is close to the sample distribution. And it not only improves the accuracy for classification but also it increases efficiency for classification since that it makes the overall structure of binary tree got by the improved algorithm is close to fully binary tree. Finally it shows that the improved algorithm has the better performance compared with the other two algorithms through a specific example.④Another core of the essay is the experiment. The experiment divides two parts. The first experiment's data is numerical data derived from UCI Database. The experiment compares the three algorithms including the improved algorithm, Slant binary tree algorithm and Fully binary tree algorithm in the accuracy of classification and the training time of the model and the result verifies the improved algorithm gets the expected effect. The second experiment will use the improved algorithm in Text Classification. The experimental data is derived from web portal and that has already defined categories. And then it compares the improved algorithm with the other algorithms named Simple Bayesian, K neighbor, Rocchio, slant binary tree and fully binary tree algorithm. Finally it concludes that the improved algorithm improves the performance for classification.
Keywords/Search Tags:Text classification, Support vector machine, Binary Trees Many classification SVM, Text pretreatment, Ball structure
PDF Full Text Request
Related items