Font Size: a A A

Chinese Web Page Classification Based On Web Page Features

Posted on:2010-01-25Degree:MasterType:Thesis
Country:ChinaCandidate:Z ZhuFull Text:PDF
GTID:2178360275977645Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development and popularity of the World Wide Web, we have now entered a digitalized age where information is extremely abundant, from the old age in which useful information was lacking. Facing the vast amount of online information, it is difficult for us to find the useful information quickly and effectively. Thus, how to organize and manage the huge amount of online information has become an important research topic. Traditionally, Web documents are classified manually which is time-consuming and labor-intensive. For these reasons, automatic Web page categorization has been studied to deal with the exponential growth of online information. Also, combined with the technoligies of information retrieval, search engines and information filtering, Web information categorizationit has become one of the important tools for acquiring information on the Internet.The main contributions of the thesis are as follows.(1) We discuss and study the key techniques for Web page categorization, including document representation, feature selection, classification, and their difficulties.(2) An attribute set for automatic recognition of news Web pages is proposed, based on the attributes of characteristics in news Web pages. The attribute set combines URL attributes with structure attributes and content attributes. Three classifiers are constructed using three different classification algorithms. The experimental results demonstrate that the classifier constructed with our proposed attribute set of news Web pages provides a high accuracy in news Web pages recognition.(3) A Chinese Web page classification method is proposed to utilize hyperlink information to improve the classification accuracy, based on the attributes of characteristics in Chinese Web pages. The hyperlink classification is used to improve the accuracy of the classifier for automatic recognition and classification of news Web pages. The experimental results demonstrate that the classifier provides a high accuracy in Chinese Web pages.(4) A topic-oriented Web search engine is designed and implemented. The crawler of the Web search engine only collect news Web pages by utilizing the automatic recognition method of news Web pages.
Keywords/Search Tags:Data Mining, Feature Selection, Web Page Classification, Hyperlink
PDF Full Text Request
Related items