Font Size: a A A

Research Of Text Categorization Algorithm Based On Rough Set Theory

Posted on:2008-10-16Degree:MasterType:Thesis
Country:ChinaCandidate:H J LiFull Text:PDF
GTID:2178360215979364Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet technology, information processing has become an indispensable tool for people to obtain useful information. Text categorization is an important research field, its target is to allocate one or more suitable classes to texts, based on analyzing the text contents. Now there are many methods that have been applied to this field, such as SVM, KNN, Naive Bayes, Decision Tree, etc. Compared with these methods, the method based on rough set has the following advantages. It does not need to supply any prior-probability information besides the data sets used for solving the problem. It includes a kind of format model, which gives knowledge obvious data meaning and can be analyzed and processed by mathematic method. It can obtain the minimum feature sets and can reduce the dimensions of feature vector, having no effect on text categorization accuracy. This method can get the simplest rules. For other methods, some cannot get obvious expressed rules, such as KNN and Naive Bayes. Some has much more redundant rules, such as Decision Tree.This thesis discusses the text categorization task using theory of rough set. Firstly, texts are pretreated including participle, statistical word frequency, managing stop-words etc. Then pick up characteristic words with TF-IDF function. Secondly, knowledge of classification is showed by decision table: characteristic words as attributes, weights as the values of attributes and classes of texts as the decision attributes. Thirdly, decision rules are produced through attributes reduction. Finally, we categorize test texts according to gained rules just in order to validate correctness.The experimental results indicate the effectiveness of the approach. It not only reduces the feature vector dimensions, but increases the precision and recall.
Keywords/Search Tags:Text classification, Characteristic selection, Rough set, Attribute approximation, Decision rules
PDF Full Text Request
Related items