Font Size: a A A

Bayesian Classifier And Web Document Classification

Posted on:2006-04-05Degree:MasterType:Thesis
Country:ChinaCandidate:X J HouFull Text:PDF
GTID:2208360155969500Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The information resource on the Web is massive, dynamic, semi-structured and heterogeneous.It is disordered because of the absence of uniform organization and management, which causes some difficulty in information retrieval. Web document automatic categorization technique can be used to effectively organize Web information resource and improve the efficiency of Web search. It has become a hot research area of Web mining.As one kind of the most important classify algorithm in data mining, Bayesian classifiers have high performance and are easy to be implemented. Based on the Bayesian network, we research their working theory and divide the classifiers into three categories according to the dependence between the attribute nodes in the network. We discuss how to learn several representative classifier models and then use Bayesian classifiers to classify Web document.Most of the Web pages are written by HTML. We first analysis the characteristic of HTML documents, then discuss the key technique of automatic text classification, including Vector Space Model, Chinese word segmentation, text feature selection, and implement a multinomial NaiVe Bayes classifier to classify Chinese Web page. With experiments, we evaluate the performance of seven feature selection algorithms, including Document Frequency, Information Gain, Expected Cross Entropy, Mutual Information, Weight of Evidence for Text, CHI, Odd Ratio. For the weakness of single classifier, we combine several text classifiers using two different assemble technique — Boosting and Bagging. And the combination schemes are proved to be valid in the experiments.At last, we discuss a Bayesian Network model which can be used to classify semi-structured document.
Keywords/Search Tags:Bayesian classifiers, Web document, feature selection, Boosting, Bagging
PDF Full Text Request
Related items