Text Classification Research Based On Support Vector Machine

Posted on:2008-10-11

Degree:Master

Type:Thesis

Country:China

Candidate:Y Liu

Full Text:PDF

GTID:2178360245956929

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

With the development of information technology, the data and resources in the internet show the characteristic of mass, and more and more information exists in the form of electronic text. In order to effectively manage and make use of these distributed massive information, Information retrieval and data mining based on content become the highly concerned area gradually. Text classification is an important foundation for information retrieval and data mining, and gradually becomes a hot research issue in recent years.In this paper, we study the every step of text classification in depth, including preprocessing, text presentation, feature selection, classification algorithm and performance evaluation. For the Chinese text classification, there is no standard corpus available so far, therefore, we collect paper titles as the Chinese corpus, and make use of which, we conduct the comparison experiment on the typical feature selection algorithms and classification algorithms. The result shows that support vector machine is the best algorithm in performance so far.To further improve the accuracy of text classification, we use latent semantic indexing to gain the latent semantic structure of original word-document matrix. Through the comparison experiment between using latent semantic indexing and not using latent semantic indexing, we find that the effect of using latent semantic indexing is not ideal, because classification information is not fully considered in the process of singular value decomposition of latent semantic indexing. To solve the problem, an improved local latent semantic indexing method is proposed, taking advantage of support vector machine to generate the local region. Such local region can better represent documents? latent semantic structure belonging to the same category, thereby improving the accuracy of the classification.Standard support vector machine is designed for solving the classification problem with two classes, and can not be directly used to solve the problem with multi classes. To solve the problem with multi classes, we must extend the support vector machine algorithm. The binary tree method of multi class classification is an ordinary method, the key issue of which is how to construct a reasonable structure to maintain high generalization ability. To solve the problem, a binary tree is constructed from down to up according to the construction process of Huffman tree, making the class easy to separate lies in the upper node and thereby a reasonable binary tree structure is constructed.

Keywords/Search Tags:

machine learning, text classification, support vector machine, latent semantic indexing, multi class classification, binary tree, Huffman tree

PDF Full Text Request

Related items

1	The Binary Tree Of Multi-Class Support Vector Machine And The Application Of It In Image Semantic Classification
2	Research On Support Vector Machine Classification Algorithm For Multi-class Texts
3	Research On Text Classification Filtering Technology Based On Latent Semantic Indexing And Support Vector Machine
4	Research On Multi-class Classification Method Of Improved SVM Based On FBT
5	Research On Multi-classification Method Based On Support Vector Machine
6	Analysis And Application For Web Text Classification Based On Support Vector Machine
7	Twin Binary Tree Support Vector Machine Classifiers
8	Research On Text Classification Method Based On Support Vector Machine
9	Research Of Support Vector Machine Based On Sample Cluster: Application To Multi-class SVM Based On Binary Tree
10	Active Learning Based Intelligent Algorithm And Their Application On Pattern Classification