Font Size: a A A

The Ensembling Chinese Web Pages Classifier Based On Bayes And Outlinks

Posted on:2010-12-28Degree:MasterType:Thesis
Country:ChinaCandidate:Y X GeFull Text:PDF
GTID:2178360275488979Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The Theories of Chinese web classification are mainly based on the theory of text classification. Because of the huge amount and the noises of web pages, many algorithms of text classification are not suit for the web pages classification and the precision is not good as expected. So we can't apply text classification algorithms to the web pages classification directly. Web page classification is different from text classification. Based on some theories and algorithms in web mining field, some theories are introduced to the web page classification. The main work I finished is as below:(1) According to the outlinks of the target web page, I append the content of outlink-pages in the same domain or directory to the target web page and input them into the class system.(2) Designing a web crawler. It can only download the target web page and outlink-pages in the same domain.(3) I introduce a different way to clean the noise in the target page. If we improve the number of topic word in the input files of class system, in other words, we can add some words from some topic related pages.So the percent of noise word in the page decreases.(4) Making a new dictionary for word segment. This dictionary, different from other dictionary used for search engine, only contains some topic word and can be used from feature selection.(5) Taking the bagging method, I use different training text to train different navie bayes classification modle. Different modle may have different result. We can vote for the final result.(6) Base on the theories and thinks above, I design a emsemble navie bayes classification modle and do some experiment to improve the theory.(7) There isn't an open training set and testing set for the researchers to use in China.So I collected a traning and testing set for web pages classification for the experiment and decide to open them for other researchers to use.
Keywords/Search Tags:Chinese web page classification, Outlinks in it's own domain, Web noise content, Naive bayes classifier, Emsemble classifier
PDF Full Text Request
Related items