Font Size: a A A

Based On Web Content Mining, Web Page Classification And Filtering Research And Applications

Posted on:2004-06-07Degree:MasterType:Thesis
Country:ChinaCandidate:X H PengFull Text:PDF
GTID:2208360182968579Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Currently, WWW is tremendous wide global informational service center , Which involves in news, finance and economics, ad, commerce, culture ,education and other information service. Many users feel theirs ability not equal to theirs ambition when they face complex huge WWW. How to help users find their's being interested in resources has been a cry for solved task.The author has designed and developed CSUIHWD system basing on Central South University campus information harbor's constructing aim. By using CSUIHWD to gather web pages on web site which users are interested in, after filter theses web pages , class them automatically based on the defined topics, then distribute these classed web pages on CSU(Central South University) web portal. By CSUIHWD , supplying csu web portal with additional resources, greatly utilizing resource in internet, and laying a stabile foundation for further constructing Chinese intelligent search engine.This paper firstly introduces some data mining and web mining's base concepts , ways and techniques, expounds what is data mining and web mining, why needs mining, and mining's advantage. At the same time, this paper also introduces web pages classing- filtering technique and CSUIHWD system prototype.Then studying the key technique of web pages content classing mining. Gathering web pages data, segmentation and building classifier are core technique of web pages content classing mining.CsuRobot executes web page gathering, which is automatically gathering web pages data program like Robot. CsuRobot adopts multithreading technique, can execute multiple gather task at the same time; Author improved converse max machine segmentation arithmetic, designed converse segmentation dictionary. After improved arithmetic had increased segmentation speed. Using statistics way based on high frequency words, which partly solved the problem that words not enrolled in dictionary; For Naive Bayes classifier does not take into account web's semi-structure, treast all words equally without discrimination. This paper thinks much ofthese words that have additional contribute and add theirs weigh, improved Naive Bayes classifier. Examination shows that the improvement is helpful.Finally summarizing our work and pointing out further research.
Keywords/Search Tags:data mining, web mining, segmentation, class, robot
PDF Full Text Request
Related items