Multi-class Scientific Literature Automatic Categorization System

Posted on:2009-03-26

Degree:Master

Type:Thesis

Country:China

Candidate:Y Q Chen

Full Text:PDF

GTID:2178360275972392

Subject:Computer system architecture

Abstract/Summary:

PDF Full Text Request

With the development of computer and communication technology, especially the global popularization and application of Internet, all kinds of text information grows explosively. And the modern information society is facing the challenge of handling massive documents (including paper, technical report), news, email and so on. With the huge number of data available on Internet, there is a growing need for text categorization so as to help users manage and utilize those data. Text categorization, which assigns text documents to pre-specified categories, plays a key role in organizing the massive sources of unstructured text information, such as filtering spam emails, classifying news, organizing documents.There many popular algorithms been used in text categorization, such as Na?ve Bayes, k-Nearest Neighbor (kNN), Support Vector Machine (SVM). However, those classification approaches do not perform well in every case, for example, SVM is a binary classifier, cannot used in multi-class categorization directly; and it needs long time when training large dataset. kNN always misclassify the category that has lesser samples; and it is difficult to determine the value of K. cannot effectively solve the problem of overlapped categories borders. In this paper, we propose an approach named as Multi-class SVM-kNN (MSVM-kNN) which is the combination of SVM and kNN. In the approach, SVM is first used to identify category borders, then kNN classifies documents that in the indivisible area. MSVM-kNN can overcome the shortcomings of SVM and kNN and improve performances of multi-class text categorization. Besides this, we researched dimension reducing, term weighting. Then designed and implemented the Multi-class Automatic Literature Categorization System (MALC) with above technical.We do experiments with 20-Newsgroups and the dataset that we collected from ACM. And the experimental results show MSVM-kNN performs better than SVM or kNN. The precision, recall and F-measure on ACM dataset of MSVM-kNN are 90.18%, 88.79%, 0.89; while that of single kNN are 81.64%, 77.78%, 0.8, single SVM are 86.11%, 84.44%, 0.85. The test result shows that the MSVM-kNN approach has better performance than traditional way.

Keywords/Search Tags:

Automatic Text Categorization, Text Representation, Feature Selection, Support Vector Machine, k-Nearest Neighbor

PDF Full Text Request

Related items

1	The Research And Implementation Of Automatic Text Categorization For Chinese Web Documents
2	A Study On Text Categorization Based On Machine Learning
3	Design And Realization Of Text Categorization System
4	Text Sentiment Analysis Based On Text Classification
5	Design And Implementation Of Kazak Text Categorization System
6	Design And Implementation Of The Technical Text Categorization System
7	A Study On Chinese Text Automatic Categorization
8	The Research And Application Of Automatic Text Classifier Based On Support Vector Machine
9	Design And Implementation Of Web Automatic Text Categorization
10	Normal Weight Based Feature Selection Method In SVM Text Categorization