Font Size: a A A

Multi-class Scientific Literature Automatic Categorization System

Posted on:2009-03-26Degree:MasterType:Thesis
Country:ChinaCandidate:Y Q ChenFull Text:PDF
GTID:2178360275972392Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
With the development of computer and communication technology, especially the global popularization and application of Internet, all kinds of text information grows explosively. And the modern information society is facing the challenge of handling massive documents (including paper, technical report), news, email and so on. With the huge number of data available on Internet, there is a growing need for text categorization so as to help users manage and utilize those data. Text categorization, which assigns text documents to pre-specified categories, plays a key role in organizing the massive sources of unstructured text information, such as filtering spam emails, classifying news, organizing documents.There many popular algorithms been used in text categorization, such as Na?ve Bayes, k-Nearest Neighbor (kNN), Support Vector Machine (SVM). However, those classification approaches do not perform well in every case, for example, SVM is a binary classifier, cannot used in multi-class categorization directly; and it needs long time when training large dataset. kNN always misclassify the category that has lesser samples; and it is difficult to determine the value of K. cannot effectively solve the problem of overlapped categories borders. In this paper, we propose an approach named as Multi-class SVM-kNN (MSVM-kNN) which is the combination of SVM and kNN. In the approach, SVM is first used to identify category borders, then kNN classifies documents that in the indivisible area. MSVM-kNN can overcome the shortcomings of SVM and kNN and improve performances of multi-class text categorization. Besides this, we researched dimension reducing, term weighting. Then designed and implemented the Multi-class Automatic Literature Categorization System (MALC) with above technical.We do experiments with 20-Newsgroups and the dataset that we collected from ACM. And the experimental results show MSVM-kNN performs better than SVM or kNN. The precision, recall and F-measure on ACM dataset of MSVM-kNN are 90.18%, 88.79%, 0.89; while that of single kNN are 81.64%, 77.78%, 0.8, single SVM are 86.11%, 84.44%, 0.85. The test result shows that the MSVM-kNN approach has better performance than traditional way.
Keywords/Search Tags:Automatic Text Categorization, Text Representation, Feature Selection, Support Vector Machine, k-Nearest Neighbor
PDF Full Text Request
Related items