Font Size: a A A

Research And Application On Focused Crawling Search Engine Based On The Lucene

Posted on:2012-01-06Degree:MasterType:Thesis
Country:ChinaCandidate:L J ZhangFull Text:PDF
GTID:2218330338973011Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Vertical search engine is a type of search engine (SE) based on focused crawling, which is also called topic-specific SE. It differs from general-purpose SE mainly in resources and services it provides, which are related to certain topic, thus it's more professional and personalized to some extent. Topic-specific search engine search the web resources on the topic-specific field, so the classification on the particular topic is more accurate and more comprehensive, the resources are full and in-depth, the update of the resources is more timely and frequently. The rapid development of the Web made search engines face a dilemma. General-purpose search engines have been unable to satisfy the user's request. So the topic-specific search engines will be more practical and feasible.The paper mainly studied on the focused crawling, which is the core of vertical search engines. The key technologies of focused crawling include web page classification technology,computing the priorities of the URLs which to be crawled and the algorithm of focused crawling. It is to be introduced the method of Web page classification based on data mining, and also put up in paper the method of computing the priorities of the TCURLs(URLs to be crawling) based on page segmentation. With the technology of page segmentation, compute the topic correlation of TCURLs in blocks of page. With the whole the context of blocks to compute the TCURLs priority, which can solved the problem of one URL anchor text cannot offer the full information to predict its topic correlation. and that's more better to filter noise blocks, improve crawling efficiency. Finally, designed and realized a vertical search engine system which faced to the compute product information. Theme crawling module is constructed based on the binary SVM classifier to determine the relevance of the theme of the current web page to improve the accuracy of the theme relevance judgement. The information processing module and the information retrieval module designed based on Lucene, which to provide a great convenience for the entire search engine construction.
Keywords/Search Tags:Focused Crawling, Search Engine, Lucene, Data Mining
PDF Full Text Request
Related items