Font Size: a A A

The Research Of Text Categorization With Rough Set Based On Extracting Double Features And Heuristic Algorithm Reduction

Posted on:2009-12-02Degree:MasterType:Thesis
Country:ChinaCandidate:J LiuFull Text:PDF
GTID:2178360272455344Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the development of the Internet technology, information resources on Internet increase quickly. Information processing work becomes more and more important. How to organize and manage these overlarge information efficiently, auto-classify these information according to the content of text , help users obtain useful knowledge and information rapidly and exactly and solve the phenomenon of information disorderly has been a focus on the computer scientific field. On the other hand, research in this field also has widely applying perspective and practical value.Rough Set can be used in text classification with the following advantages. First, it does not need to supply any prior-probability information besides the data sets used for solving the problem. Second, it can utilize mathematic method to analyze and process problem. Third, it can obtain the minimum feature sets which text classification needs. Fourth, it can reduce the dimensions of feature vector without side-effects on text classification accuracy. Last but not least, it can get the simplest rules.This paper researches the text classification based on Rough Set. The main works show as below:1. This paper has studied the feature selection methods. We adopt a weight adding way to improve the traditional TF-IDF method, and unify the weight added TF-IDF and CHI methods to extract the features of the words database. This method taking the intersection of the two methods as the final result can lead to some weakly representative features to be filtered.2. This paper has studied heuristic algorithm reduction based on Rough Set theory, improved the heuristic algorithm reduction base on the core of the discrimination matrix. The experimental results indicate that the improved attribute reduction algorithm used in reducing decidetable can receive a better classifying result.3. A text categorization system based on rough set theory is realized, we categorize test texts and the result is good.
Keywords/Search Tags:Text categorization, Rough set, Features extract, Heuristic, Attribute approximation
PDF Full Text Request
Related items