Font Size: a A A

Based On Rough Set Text Automatic Classification Study

Posted on:2007-11-18Degree:MasterType:Thesis
Country:ChinaCandidate:Y ZhangFull Text:PDF
GTID:2208360185464695Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the World Wide Web, the network becomes the effective platform to exchange and process information. More and more information has been expressed as text. To effectively organize and analyze massive Web information resource and help users to promptly get knowledge and information, it is more important and significant to find an effective text classification algorithms for classifying and organizing large-scale documents in the web.Rough Set theory as a new computation tool was proposed by Z.Pawlak in 1982. It can effectively analyz and process the inaccurate, inconsistent and uncertain information without any prior information. Since it is introduced into machine learning, artifical intelligence, it has been applied in the fields of knowledge acquision, rule generation, decision analysis, pattern recognition, data mining successfully. This paper carries on in-depth research to text mining based on rough set theory. The main works are as follows:1 , Chinese phrase segmentation is the premises and difficulty that we analyze the Chinesetext. We design a new algorithm for Chinese phrase segmentation by tagging the lexicon with useful words and useless words on the base of predecessor method. It is considered to process the ambiguous words. Using this method, we can extract several synthetic features to stand for entire former information well. Thus reduce the dimension and time complexity;2, A new algorithm of term weighting is applicated in automated text categorization. The algorithm considers term distributation among and inside class;3, A reduction algorithm based on rough set is improved and then applicated to extract the rules of text categorization. Firstly a decision table is created, in which the weights of text characteristic terms is discretized as the rules' condition attributes. Then, the rules of text categorization are extracted by knowledge reduction of RS. The numbers of rules extracted are reduced. The accuracy and speed of the text categorization is improved.
Keywords/Search Tags:Text categorization, Rough sets theory, Textual feature extraction, Word segmentation, Reduction algorithm, text clustering
PDF Full Text Request
Related items