Font Size: a A A

Classification System Based On The Theme Of Information Acquisition In The Pages

Posted on:2007-05-01Degree:MasterType:Thesis
Country:ChinaCandidate:X R WanFull Text:PDF
GTID:2208360185453699Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
As Information on Internet is available in abundance, Internet is becoming a vital source of knowledge getting. But information is too much to look up valuable information efficiently. For this reason, it is very important to neaten the information on Internet. Our research focuses on Chinese Web document automatic text categorization in the information collection of focused crawling which is crawling the Web.First, the background of this task is discussed in this paper. And the primary technologies in the information collection of focuse crawling are indroduced. We designed the information collection of focus crawling model, including topic picking, initial URL picking, Spider crawling, page parsing, Chinese text splitter and text classifying. Finally, the primary function and arithmetic with java source code are discussed in this paper. Then introduce a text categorization method use in this system, Naive Bayes classifier. Finally, give the evaluation of Naive Bayes categorization method with experiences.Naive Bayes model is a kind of classifier base on rate statistics, although Naive Bayes model base on the independence assumption, but it's still a very efficient classifier. Experiment proof it's categorization veracity can attain 90%.
Keywords/Search Tags:Focused crawling, Spider crawling, Chinese text splitter, Chinese text categorization, Naive Bayes Classifier
PDF Full Text Request
Related items