Font Size: a A A

The Research And Implementation Of Web Page Classification In Enterprise Search Engine

Posted on:2009-05-08Degree:MasterType:Thesis
Country:ChinaCandidate:L Z LiuFull Text:PDF
GTID:2178360308978306Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Over the past decade we have witnessed an explosive growth on the Internet.The development of Internet make us break through the limitation of localization, and access online documents all over the world. In this circumstance, along with the development of internet and enlargement of enterprise scale, there are more and more web pages in corporations. How to organize and manage these imformations efficenetly has become an urgent problem. Tools like search engines indeed can assist users in locating information on the Internet, but it is limited in organize and manage web pages and electronic documents in corporations.So it is necessary for us to design and implement a tools which can organize and classify web pages-Web Page Classifier.In this thesis, I study deeply the two kind of technology about web page classification (feature selection and classification algorithms). Firstly, propose the feature selection algoritm based on part of speech and SWT term weighting method. This method can filter some empty word, reduce dimension, and at last enhance classification efficiency; for term weighting, we don't use TF-IDF, but propose SWT weighting method. Secondly, propose Improved KNN algorithm, this method enhance the Recall and classification efficiency. Thirdly propose Vector Space Model combined with the structure of web pages algorithm, this method mainly target on the particularity of web pages, combine Vector Space Model and the structure of web pages. At last, design and implement Web Page Classifier based on these two methods.In this thesis, we firstly introduce the Enterprise Search Engine. Secondly, introduce the technology of web pages classification, including web page representation model, common feature selection methods and web pages classification methods (KNN, SVM, Rocchio, Naive Bayes). At last, study the new feature selection method and classification methods. For term weighting, propose SWT weighting method.By testing, the two methods can enhance recall and precision of web page classification; enhance classification efficiency. It can satisfy the demands of enterpise search engine automatic classification.
Keywords/Search Tags:Enterprise Search Engine, web page classification, feature selection, KNN, vector space model
PDF Full Text Request
Related items