Font Size: a A A

The Research Of Text Categorization Based On Rough Set

Posted on:2006-06-07Degree:MasterType:Thesis
Country:ChinaCandidate:J L LuFull Text:PDF
GTID:2168360155956973Subject:Computer applications
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet technology, information processing has become an indispensable tool for people to obtain useful information. Text categorization is an important research field, whose target is to allocate one or more suitable classes to texts, based on analyzing the text contents. Now there are many methods that has been applied to this field, such as SVM, KNN, Naive Bayes, Decision Tree, etc. compared with these methods, the method based on rough set has the following advantages: doesn't need to supply any prior-probability information besides the data sets used for solving the problem; includes a kind of formal model, which gives knowledge obvious data meaning and can be analyzed and processed by mathematic method; can obtain the minimum feature sets; can reduce the dimensions of feature vector, having no effect on text categorization accuracy; can get the simplest rules. For other methods, some can't get obvious expressed rules, such as KNN and Naive Bayes, some have too much redundant rules, such as Decision Tree.This paper fulfilled the text categorization task using the perfect reduction theory of rough set. It mainly finished the following several jobs:I. Pretreated the documents, including words segmentation, part-of-speech tagging, frequencies statistics, position marking;II. Employed double comparing method to extract features, which is widely used in policy-making area. Double comparing method simplified the feature extraction algorithm and increased its precision. It is also an innovation of this paper;III. Took into account the influences of position and inverse document frequency in same and different classes, improved the Okapi term weighting formula and separated the term weights;IV. According to the boundary between categories, we use attributes reduction and relative reduction to reduce the dimensions of feature vectors, which is the key task of this paper;...
Keywords/Search Tags:Text Categorization, Double Comparing, Rough Set, Attributes Reduction, Relative Reduction
PDF Full Text Request
Related items