Font Size: a A A

Study On Web-Pages Classification Based On Rough Set And "Rule+Exception"

Posted on:2008-03-18Degree:MasterType:Thesis
Country:ChinaCandidate:Y X LiuFull Text:PDF
GTID:2178360242458963Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Along with the rapid development of information technology, network information increases explosively. It's a real researching hotspot to make network information easier and more efficient to be used. The information in Internet is in short of organization and full of a mass of pages. On the other hand, people want to retrieve information quickly and accurately. The technique of automatic web pages classification seemed as a good approach to solve such problems.To effectively organize and analyze massive web information resource and help users to promptly get knowledge and information they need, this thesis extracts diverse rules according to users' different requirements and analyses the existing exceptions to reach the aim of accurate classification on the basis of the learning theory that rules and exception are complementary. This paper studies the Chinese web text mining techniques deeply in the aspects of theory and application, puts forward applying rough sets and the learning theory of "rule + exception" in natural language processing to Chinese web text mining and realizes a classifier of the Chinese web page text. The key techniques of Chinese web pages classification and the main theory of rough sets, rule induction and exception analyzing have been introduced systematically in this thesis. At last, a Chinese web pages classifier has been designed under the guidance of the theory. The achievements of this thesis are:Unlike the general text classification, we need to collect Chinese web pages, preprocess these web pages and save the weight of the text information. First, a preemptive multi-thread web text collector which is used to collect web pages of special catalog using Depth First Algorithm is realized. Besides, a web text preprocessor which is used to erase the meaningless HTML tag and extract web text by recursive match method is implemented.Furthermore, a weight computing algorithm is improved taking into account of the characters of text information and web pages information. To be important, an attributes reducing algorithm oriented users' requirements is proposed, which is proved to be highly effective in the text classification system and a Reduct exception analysis method is proposed based on the theory of rough sets by analyzing the reasons that rules and exception appear in the web pages text classification.At last, the designing process of Chinese web pages text classification is listed and the Chinese web pages text classifier based on the theory of rough set and rule plus exception is realized according to the process. To evaluate the performance of the classifier, we did two experiments and compared the results. The results show both the efficiency and the correctness of the web pages text classification system are higher and these researches are worthy to be referenced in the field of text classification.
Keywords/Search Tags:text classification, feature extracting, rough sets, rule induction, exception analysis
PDF Full Text Request
Related items