Study On Web-Pages Classification Based On Rough Set And "Rule+Exception"

Posted on:2008-03-18

Degree:Master

Type:Thesis

Country:China

Candidate:Y X Liu

Full Text:PDF

GTID:2178360242458963

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

Along with the rapid development of information technology, network information increases explosively. It's a real researching hotspot to make network information easier and more efficient to be used. The information in Internet is in short of organization and full of a mass of pages. On the other hand, people want to retrieve information quickly and accurately. The technique of automatic web pages classification seemed as a good approach to solve such problems.To effectively organize and analyze massive web information resource and help users to promptly get knowledge and information they need, this thesis extracts diverse rules according to users' different requirements and analyses the existing exceptions to reach the aim of accurate classification on the basis of the learning theory that rules and exception are complementary. This paper studies the Chinese web text mining techniques deeply in the aspects of theory and application, puts forward applying rough sets and the learning theory of "rule + exception" in natural language processing to Chinese web text mining and realizes a classifier of the Chinese web page text. The key techniques of Chinese web pages classification and the main theory of rough sets, rule induction and exception analyzing have been introduced systematically in this thesis. At last, a Chinese web pages classifier has been designed under the guidance of the theory. The achievements of this thesis are:Unlike the general text classification, we need to collect Chinese web pages, preprocess these web pages and save the weight of the text information. First, a preemptive multi-thread web text collector which is used to collect web pages of special catalog using Depth First Algorithm is realized. Besides, a web text preprocessor which is used to erase the meaningless HTML tag and extract web text by recursive match method is implemented.Furthermore, a weight computing algorithm is improved taking into account of the characters of text information and web pages information. To be important, an attributes reducing algorithm oriented users' requirements is proposed, which is proved to be highly effective in the text classification system and a Reduct exception analysis method is proposed based on the theory of rough sets by analyzing the reasons that rules and exception appear in the web pages text classification.At last, the designing process of Chinese web pages text classification is listed and the Chinese web pages text classifier based on the theory of rough set and rule plus exception is realized according to the process. To evaluate the performance of the classifier, we did two experiments and compared the results. The results show both the efficiency and the correctness of the web pages text classification system are higher and these researches are worthy to be referenced in the field of text classification.

Keywords/Search Tags:

text classification, feature extracting, rough sets, rule induction, exception analysis

PDF Full Text Request

Related items

1	The Methods Of Classification Rule Extraction Based On Rough Sets Technique
2	Based On Rough Set Text Automatic Classification Study
3	Research On Multi-label Text Classification Methods Based On Rough Sets
4	Text Classification Model Based On Fuzzy-Rough Sets Theory
5	Study On Chinese Text Classification Algorithm Based On Rough Set And It's Application
6	The Research Of Chinese Text Categorization Based On Rough Set In Spam Filtering
7	Research On Data Stream Classification Based On Granular Computing And F-Rough Sets Extension
8	Research Of Medical Image Classification Approach Based On Rough Sets And Association Rule
9	Research And Application Of Text Feature Reduction And Classification Rule Extraction
10	The Study On The Neural Network Classification And The Extracting Rule Technology