Font Size: a A A

Research On Chinese Web Page Categorization And Implementation Of Pre-classification Algorithm

Posted on:2010-03-02Degree:MasterType:Thesis
Country:ChinaCandidate:S M XuFull Text:PDF
GTID:2198330332988628Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of information technology and popularity of Internet, the amount of web pages increases largely. How to classify web pages automatically by their contents becomes an important research subject in order to organize and process so large amount of data.This paper introduce some technologies with Chinese web page categorization first, and three automatic classification algorithms (Category Centroid, Naive Bayes, Support Vector Machine) that based on machine learning are explored too.Then, we implement a Chinese web page automatic classification system that base on vector space model. Studies on Chinese web page automatic classification are carried out by four experiments. The main conclusions of experiments are as follows:the linear kernel function of SVM is more suitable for Chinese web page categorization; document frequency is a rapid and efficient method for Chinese web page; the optimal number of features depends on the scale of training set and automatic classification algorithm.Finally, We propose a pre-classification algorithm that based on a given keywords list according to the characteristic of Chinese web page, and combine it with category centroid, naive Bayes and support vector machine respectively. The experimental results show that this algorithm can not only improve precision and recall but also reduce time greatly.
Keywords/Search Tags:Chinese Web Page Categorization, Category Centroid, Naive Bayes, Support Vector Machine
PDF Full Text Request
Related items