Font Size: a A A

The Research And Application Of Rough Set In Text Categorization System

Posted on:2008-02-11Degree:MasterType:Thesis
Country:ChinaCandidate:S M YangFull Text:PDF
GTID:2178360215972132Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of computer technology and communication technology, people can acquire more and more text information. It is a great challenge for information science and technology that how to organize and process large amount of document information, and fine the interested information of user quickly, exactly, and fully. As key technology in organizing and processing large mount of document data, text categorization can allocate one or more suitable classes to texts based on analyzing the text contents. Moreover, text categorization has the broad applied future as the technical basis of information filtering, search engine, text database, and digital library and so on.This paper mainly researches the system of text categorization based on the Rough sets theory systematically and deeply. The research results are descried as follows in detail:The theory of Rough sets, presented in 1982 by Polish mathematician Pawlak Z, is a powerful mathematical tool for analyzing uncertain, fuzzy knowledge. Rough sets, as a new hotspot in the field of artificial intelligence, can effectively deal with the expression and deduction of incomplete, uncertain knowledge. It need not supply any prior-probability information besides the data sets used for solving the problem; includes a kind of formal model, which can be analyzed and processed by mathematic method; can obtain the minimum feature sets .can reduce the dimensions of feature vector, having no effect on text categorization accuracy; can get the simplest rules. For other rules, some can't get obvious expressed rules, such as Na?ve Bayes and KNN.(1)This paper particularly introduces Rough sets and its correlation theory and means, the basic content of the text categorization; analysis their researching backgrounds and actuality; discusses the future developmental trends, hot research fields. All of the above become the basis for the paper.(2)On the basis of commonly relatively reduction algorithms of Rough sets and Tabu algorithm, an improved attribute reduction algorithm is presented by researching the advantages and disadvantages of existing attribute reduction algorithms. The attribute importance is heuristic information of the improved algorithm which can get a least reduction.(3)In order to shield word segmentation for text, this paper presents a text expression method and an algorithm of getting key words. This algorithm overcomes the GF/GL method that was presented by ZhangXueying problem which can't be solved when the same characters'appearing frequency is 1 in especially literature.This paper presents a text categorization system model based Rough sets mainly including text pretreatment model, attribute reduction model and matching rules mode. It mainly deeply researches attribute reduction and matching rules. Finally, utilizing imitating test, the text categorization is feasible based Rough sets theory.The drawback of this paper is two abstracts which is the limit of word database,suspend term table and the calculating of knowledge gram which is just deal with research moment and doesn't form unit knowledge structure. Knowledge gram importance, as a heuristic information, that is used in attribute reduction and text expression is very little. And it is puzzle for me to research the soft calculation. It has still many problems to be worthily discussed for us. The algorithm of this paper is feasible, related algorithm and imitating test system need to be future developed.
Keywords/Search Tags:Rough sets, knowledge granularity, importance, text categorization, attribution reduction, matching rules
PDF Full Text Request
Related items