Font Size: a A A

Based On The Genre Of Chinese Web Page Classification

Posted on:2008-12-25Degree:MasterType:Thesis
Country:ChinaCandidate:Z Z HuangFull Text:PDF
GTID:2208360215484774Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of communication and Internet technology, people have already transited into the age in which information is extremely abundant and digitized from the age lack of information. The rapid increasement of information resource stored in the form of text leads to the urgent demand of fast and automatic text categorization. However, at present, most research on automatic text categorization has focused on content without considering the features of text function, form and structure. Therefore, in some aspects, it doesn't meet people's requirement. The research on text categorization based genre plays an active role in effective information management and retrieval to some extent.This paper gives details to the research on Chinese web pages categorization based on genre. The main points of the paper are as follows:(1) The feature selection of Chinese web pages categorization based on genre. The traditional text categorization only considered words as features. Besides, because of the linguistic expression differences between Chinese and English, the research on feature selection of English text categorization is not completely suitable for Chinese text categorization. This paper mainly researches on different kinds of features that distinguish different genre classes. A new feature expression--fuzzy character pattern was presented to express the linguistic characters of different genre classes. The fuzzy character pattern features are gained by combining automatic extraction with artificial induction. In the implementation step, candidate features are extracted by sequence mining that amends the storage structure of PAT Tree (Patricia Tree), so that the classifier can shake off the burden of words segmentation procedures and large dictionaries. Also, the method overcomes the shortcomings brought by the dictionary updating problem and the poor performance of new words and English phrases extraction to which words segmentation leads in traditional method.(2) The weight-computing problem of terms. This paper explores the distinguishing ability of different feature space that help to distinguish different genre classes and its evaluation method, based the character of multiple feature sets of genre categorization. Based on this, an approach of term weight adjustment is presented which amends the weight of terms in different feature space according to the distinguishing ability of different feature space that help to distinguish different genre classes.(3) Categorization algorithm. This paper introduces the text association classification rules mining, and improves the performance of SVM (Support Vector Machines) classifier by combining with association rules categorization. It also discusses the problem of association rules mining, optimization and the combination of classifiers. An approach and algorithm of improved rules optimization is proposed.The experiment results show that, the proposed feature selection method is feasible as a whole and the introduction of fuzzy character pattern features also help to improve the classification results of some genre classes. Furthermore, the proposed evaluation strategy of the distinguishing ability of feature space to different genre classes accords with experience knowledge. The weight adjustment method also improved the whole performance of the classifier. Association rules are to some extent helpful for the performance of SVM classifier on the whole, but not much markedly.
Keywords/Search Tags:web pages categorization, genre, fuzzy character pattern, sequence mining, weight adjustment
PDF Full Text Request
Related items