Font Size: a A A

Research On Text Classification Based On Rough Set

Posted on:2008-10-23Degree:MasterType:Thesis
Country:ChinaCandidate:Y LiuFull Text:PDF
GTID:2178360215459210Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the widespread using and development of Internet, various e-texts, as the main expression of over 80 percent Internet information, have appeared at a fantastic speed. Thus, most of the useful information have usually been covered by the useless, which results in the phenomenon of "information more abundant but knowledge more poor". How to organize and manage these overlarge information e-texts, auto-classify these information according to the content of text and help users obtain useful knowledge and information exactly, has been a focus on the computer scientific field, on the other hand, the research also have a widely applying context and practical value.Rough Set, proposed by Polish scientist Z. Pawlak in 1982, is not only a set theory to deal with the vague and uncertain problems but also to build the relationships between knowledge and classification. The main idea of this theory is in the precondition of the same classifying capability, to conduct decision or classification rules by knowledge reduction. After being conducted to the areas of machine learning, artificial intelligence and etc., in the 1990's, this theory has been successfully used in various fields such as knowledge obtain, rule extraction, decision analysis, pattern recognition, data mining, and so on.Some research of text classification based on Rough Set theory has been done and the main work is showed as follows:The feature selection methods used in text classification and the weight calculating formula of TF-IDF belonging to text vector model have been studied, then, this paper compare the performance of the improved weight calculating formula called TF-EDF which based on text feature selection methods, finally, choose which formula is better to fit to calculate.By using simple equal-distance data discretizing method, text classification based on Rough Set theory results in a serial of comprehensible text classification rules.During studying attributes reduction algorithm based on Rough Set theory two different attributes importance,measures between text feature selection methods and Rough set theory's have been integrated, meanwhile, this paper detailedly compares the performances among various attribute measures both in precision reduction and concise reduction to mining an appropriate approach to measure the attributes importance for heuristic attributes reduction.Using both reduction information and changing reducing times from once to twice, an improved heuristic attributes reduction algorithm is put forward in this paper.The experiment results indicate that the attributes reduction based on Rough set theory, which applied for text classification, can significantly reduce the text describing dimensions, solve the "higher dimensional disaster" problem effectively and have a appropriate classifying correct rate for text classification rules which generated by reducing attributes.
Keywords/Search Tags:Text classification, Feature selection, Weight calculation, Rough set, Attributes reduction
PDF Full Text Request
Related items