Font Size: a A A

Web Page Categorization And Its Application To Search Engine

Posted on:2009-04-30Degree:MasterType:Thesis
Country:ChinaCandidate:X K XuFull Text:PDF
GTID:2178360242994196Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Web Page Automatic Categorization is the key technique for processing and organizing large-scaled web text resource and the essential approach of arranging web resource efficiently and appropriately. It is also the key technology involved in many issues such as topic search engine, personalized information retrieval and navigation service in search engine. Thus our research is of great importance to develop novel and intelligent search engine.This paper mainly discusses web page categorization techniques and its application to search engine. Our work can be summarized as follows.1. Several issues involved in web page automatic categorization are discussed, such as feature selection, feature extraction, classification methods and process of web page. A web page automatic categorization subsystem is designed and implemented. Many techniques are integrated in the system which makes it flexible in architecture and easy to extend.2. An ensemble learning and category indicator based categorizing method is proposed. The Adaboost.MH based mechanism is applied to adaptively compute the category indicator in all iterations. Then all individual category indicators are combined with weight and an approximation to the expected category indicator is obtained. Based on the combined category indicators we can obtain a classifier which has low computational cost, flexibility in updating with new features and suitable for real-time applications.3. Ensemble learning and DragPushing Strategy are combined to refine Centroid Classifier Model Bias. The AdaBoost.MR based mechanism, which employs Centroid Classifier as its individual classifiers, is developed to adaptively improve classifier model by focusing on examples with high weight (Thus tend to be labeled incorrectly) in every iteration.4. The basic idea of words clustering based feature extraction is discussed. Mended Tree-Structured Growing Self-Organization Map (TGSOM) is used for words clustering and a term weighting formula which takes into account the distinction between clustered words feature and plain word feature is employed.5. Several key issues of topic crawler are discussed and the corresponding new approaches are proposed .A topic crawler subsystem is designed and implemented. This subsystem employes topic sensitive HITS to predict the priority of web page to be fetch6. At last, a topic search engine prototype is designed and implemented. The application of web page categorization to the system is discussed.
Keywords/Search Tags:Web Page Categorization, Ensemble Learning, Search Engine, Feature Extraction, Topic Crawler
PDF Full Text Request
Related items