Font Size: a A A

Text Classification Research Based On Support Vector Machine

Posted on:2008-10-11Degree:MasterType:Thesis
Country:ChinaCandidate:Y LiuFull Text:PDF
GTID:2178360245956929Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the development of information technology, the data and resources in the internet show the characteristic of mass, and more and more information exists in the form of electronic text. In order to effectively manage and make use of these distributed massive information, Information retrieval and data mining based on content become the highly concerned area gradually. Text classification is an important foundation for information retrieval and data mining, and gradually becomes a hot research issue in recent years.In this paper, we study the every step of text classification in depth, including preprocessing, text presentation, feature selection, classification algorithm and performance evaluation. For the Chinese text classification, there is no standard corpus available so far, therefore, we collect paper titles as the Chinese corpus, and make use of which, we conduct the comparison experiment on the typical feature selection algorithms and classification algorithms. The result shows that support vector machine is the best algorithm in performance so far.To further improve the accuracy of text classification, we use latent semantic indexing to gain the latent semantic structure of original word-document matrix. Through the comparison experiment between using latent semantic indexing and not using latent semantic indexing, we find that the effect of using latent semantic indexing is not ideal, because classification information is not fully considered in the process of singular value decomposition of latent semantic indexing. To solve the problem, an improved local latent semantic indexing method is proposed, taking advantage of support vector machine to generate the local region. Such local region can better represent documents? latent semantic structure belonging to the same category, thereby improving the accuracy of the classification.Standard support vector machine is designed for solving the classification problem with two classes, and can not be directly used to solve the problem with multi classes. To solve the problem with multi classes, we must extend the support vector machine algorithm. The binary tree method of multi class classification is an ordinary method, the key issue of which is how to construct a reasonable structure to maintain high generalization ability. To solve the problem, a binary tree is constructed from down to up according to the construction process of Huffman tree, making the class easy to separate lies in the upper node and thereby a reasonable binary tree structure is constructed.
Keywords/Search Tags:machine learning, text classification, support vector machine, latent semantic indexing, multi class classification, binary tree, Huffman tree
PDF Full Text Request
Related items