Research On Text Classification Based On Rough Set

Posted on:2008-10-23

Degree:Master

Type:Thesis

Country:China

Candidate:Y Liu

Full Text:PDF

GTID:2178360215459210

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the widespread using and development of Internet, various e-texts, as the main expression of over 80 percent Internet information, have appeared at a fantastic speed. Thus, most of the useful information have usually been covered by the useless, which results in the phenomenon of "information more abundant but knowledge more poor". How to organize and manage these overlarge information e-texts, auto-classify these information according to the content of text and help users obtain useful knowledge and information exactly, has been a focus on the computer scientific field, on the other hand, the research also have a widely applying context and practical value.Rough Set, proposed by Polish scientist Z. Pawlak in 1982, is not only a set theory to deal with the vague and uncertain problems but also to build the relationships between knowledge and classification. The main idea of this theory is in the precondition of the same classifying capability, to conduct decision or classification rules by knowledge reduction. After being conducted to the areas of machine learning, artificial intelligence and etc., in the 1990's, this theory has been successfully used in various fields such as knowledge obtain, rule extraction, decision analysis, pattern recognition, data mining, and so on.Some research of text classification based on Rough Set theory has been done and the main work is showed as follows:The feature selection methods used in text classification and the weight calculating formula of TF-IDF belonging to text vector model have been studied, then, this paper compare the performance of the improved weight calculating formula called TF-EDF which based on text feature selection methods, finally, choose which formula is better to fit to calculate.By using simple equal-distance data discretizing method, text classification based on Rough Set theory results in a serial of comprehensible text classification rules.During studying attributes reduction algorithm based on Rough Set theory two different attributes importance,measures between text feature selection methods and Rough set theory's have been integrated, meanwhile, this paper detailedly compares the performances among various attribute measures both in precision reduction and concise reduction to mining an appropriate approach to measure the attributes importance for heuristic attributes reduction.Using both reduction information and changing reducing times from once to twice, an improved heuristic attributes reduction algorithm is put forward in this paper.The experiment results indicate that the attributes reduction based on Rough set theory, which applied for text classification, can significantly reduce the text describing dimensions, solve the "higher dimensional disaster" problem effectively and have a appropriate classifying correct rate for text classification rules which generated by reducing attributes.

Keywords/Search Tags:

Text classification, Feature selection, Weight calculation, Rough set, Attributes reduction

PDF Full Text Request

Related items

1	Study On Text Classification Based On Rough Set And Support Vector Machine
2	Research On Text Emotion Classification Based On Rough Set
3	The Research On Text Classification Technology Based On The Rough Set Theory
4	Research On Optimization Of Text Classification Based On Improved Rough Set Model
5	Research On Integrated Classification Algorithm Based On Rough Set Attribute Reduction
6	Study Of Chinese Text Classification
7	Study Of Web Text Mining Based On Rough Set Theory
8	The Research Of Chinese Text Categorization Based On Rough Set In Spam Filtering
9	Text Sentiment Analysis Based On Text Classification
10	The Design And Implementation Of Text Classification System Based On SVM-KNN