Font Size: a A A

Automatic Classification Of A Chinese Web Pages To Achieve

Posted on:2003-11-26Degree:MasterType:Thesis
Country:ChinaCandidate:K L WangFull Text:PDF
GTID:2208360065456067Subject:Computer applications
Abstract/Summary:PDF Full Text Request
Search engine is a capital tool of Internet information retrieval. Automatic categorization of Chinese web page is an important study direction in the implementation of Chinese search engine. By the automatic categorization, web pages is distinguishingly created into corresponding data bases according to category info, which improve recall and precision ration of Chinese search engine. In the meantime, automatic categorization info resource is established to provide category message catalog for users. In addition, the quality of automatic categorization in some measure has positive effect upon sequent relativity sort process.This paper analyzes structure components on the web page contributing to categorization process and, aiming at characteristics of Chinese web page and requirement of participle quality in web page analysis process, accordingly simplifies and adjusts the in being algorithm about longer/longest participle, thereby it further applies in automatic categorization process. By utilizing the IDF (Inverse Document Frequency) formula in automatic categorization process, which was used in information retrieval field to calculate the relativity term weight between keywords and relevant documents, and combining with analysis result of Chinese web page, the formula carrying adjustable parameter for calculating the correlative degree is obtained. Categorization correlative degree vector library, which is used to conserve categorization-training result, is designed and established to meet demands of the formula. An automatic categorization method of Chinese web page, which has practical signification, is achieved by using corpus training result and VSM model.Through close and open cycle tests, the results of experiment show that, this method can improve the correct recognition rate of correlative web pages to upward of 90% with little decline in efficiency, which is superior to the former one ?Probability Distributing Algorithm. It is supposed to have a good application prospect.
Keywords/Search Tags:Automatic categorization, Search Engine, IDF, VSM
PDF Full Text Request
Related items