Font Size: a A A

Study On Chinese Text Classification Algorithm Based On Rough Set And It's Application

Posted on:2011-11-24Degree:MasterType:Thesis
Country:ChinaCandidate:B F ZhangFull Text:PDF
GTID:2178360302993792Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the network technology, the amount of information on the network is dramatically increasing. It is how to effectively organize and manage these online documents that has become an urgent problem to be solved. Text classification has become a key solution to the problem. And it is an important branch of text mining, which has get more deeply researched because of its unique knowledge discovery function. Text classification has been in a wide range of application fields, such as information filtering, retrieval, digital library services and so on, and it,has broad application prospects.Rough set theory can deal with fuzzy and uncertain knowledge. it can effectively analyze and deal with incomplete, inconsistent, inaccurate data, without any prior information. Thus knowledge can be analyzed and dealt with using a mathematical method, and implicit knowledge can be discovered, and, potential rules can be revealed. The main idea of rough set theory is to lower the dimension of feature vectors without affecting the classification accuracy, and to obtain the simplest classification rules.This paper mainly researches the system of text classification based on the Rough sets theory systematically and deeply, and the algorithm were applied to the classification system in the Public Security Intelligence System. The main work of the paper is as follows:(1) This paper describes the relevant text classification techniques, and some commonly used text classification algorithm is a detailed analysis and comparison.(2) Aiming at the problem that document set is dealt with as a whole and distribution of features among and in classes is not taken into account when using traditional TFIDF method, an improved TFIDF method which is combined with information entropy is proposed. This method modifies the method of calculating weights of features of TFIDF by combining itself with information entropy of features among and in classes, which overcomes the defect that the features that made less contribution to the categorization are given greater weight, and calculates weights of text features more efficiently.(3)Aiming at the problem that the traditional feature selection method which filters features using frequency threshold would result in information loss and reduce the classification precision, a novel automatic text categorization method based on rough set is proposed. In the proposed method, the weighted attribute features discretization is carried out to form a decision table; then, selection of conditional attributes at the decision table is carried out on the basis of attribute significance which is based on dependency degree; finally, the reduction of text attribute features is performed by heuristic algorithm which is based on conditional information entropy.(4) The improved TFIDF method and a novel automatic text categorization method based on rough set which were proposed in this paper were applied to the public security intelligence classification subsystem. Practical application shows that the use of the system can obtain better results for text classification.
Keywords/Search Tags:text classification, TFIDF, vector space model, Rough sets, attribute reduction
PDF Full Text Request
Related items