Font Size: a A A

Text Categorization Based On Rough Set Theory

Posted on:2012-05-16Degree:MasterType:Thesis
Country:ChinaCandidate:X XuFull Text:PDF
GTID:2218330368497577Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
In recent years, with the development of Internet and the information technology, people will face with the fact that with the amount of information increasing ceaselessly, it is more and more urgent to find a way to manage and access information effectively and easily. Text categorization is a very good solution to this problem. Text Categorization is the key topic in many areas such as Information Retrieval, Data Mining and so on. There are many methods having been applied to text categorization now, for example KNN, Na?ve Bayes, Decision Tree, SVM and so on.The rough set theory is proposed by Pawlak in 1982, which is a powerful tool for dealing with imprecise or incomplete information in attribute dependence analysis, knowledge reduction and decision rule extraction. The rough set has the following advantages in text categorization: firstly, the Rough set doesn't need to supply any prior-probability information besides the data set used for solving the problem; secondly, Rough set theory can reduce the dimensions of feature vector and get classification rules of explicit formulation without influencing the accuracy of text categorization.Feature weighting is an important problem in text categorization. For computing feature weights, we analyzed the characteristics of rough set theory and TFIDF, and proposed a feature weighting scheme for text categorization based on rough set theory in this paper. In rough set theory, approximation quality and approximation accuracy can reflect the importance of the feature from a global perspective, so we can introduce the rough set theory to the weight of the feature word. However, if there are only these two parameters in the weighting formula, the information of the feature in single text will be ignored. TFIDF cares about the frequency of feature words and the distribution of the feature word in the whole examples space. So the frequency of the feature will be introduced into the feature weight. The weighting formula combines the advantages of TFIDF and rough set theory.In most cases, the rules induced by rough set reduction theory are unacceptable as laws to classify test texts. There are many reasons for this problem, the main point is the test texts are various, it is not easy to get a comprehensive rule sets. By analyzing the method of complete matching and partial matching, we proposed a new partial matching method based on feature weight, this method combines the idea of the partial matching and feature weight. The experiments show that the partial matching method based on feature weight for matching rules can improve the matching possibility and correctness of basic decision rules.Finally, we concluded the achievements and insufficient points of the article and looked ahead the next research work.
Keywords/Search Tags:Text Categorization, Rough set theory, Feature weight, Matching rule
PDF Full Text Request
Related items