Font Size: a A A

Research On Topic Web Page Crawling Strategy For Vertical Search Engine

Posted on:2013-06-03Degree:MasterType:Thesis
Country:ChinaCandidate:Z J XieFull Text:PDF
GTID:2248330395477141Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of computer network technology, the World Wide Webhas become the carrier of the mass of information, how to efficiently use the informationis considered as a huge challenge for people. Search engine as a tool for informationretrieval, it has become the guide and entrance to the user to access the World Wide Websince its birth after the rapid development. However, traditional search engines are facingwith the size of the web index, the speed of the individual needs and inaccurate queryresult and many other serious challenges. In order to solve the problems, vertical searchengine became one of the hotspots of a new generation of search engines. The specifictopic web page crawler technology is the core technology to construct vertical searchengine, and its goal is to crawl the web pages related to a specific topic as much aspossible while maximizing avoid irrelevant web pages to download.In this paper, set the theme webpage crawl strategy of the vertical search engine asthe research content, to improve the rate and efficiency of the theme webpage crawl,detailed analysis the advantages and disadvantages of the existing theme webpage crawl.Analyze the realization and the advantages and disadvantages of the theme webpagecrawling strategy which based on the hidden Markov model. And propose an improvedtheme webpage crawling strategy. Improve the calculation of the page pretreatment’sterm weight and give different weights to terms in different locations, from these ways tomake the term weight represent the real content of the webpage. In order to improve theaccuracy of the theme webpage crawling, improving the calculation method of thepriority value of URL in the crawling queue, considering methods of the hidden Markovmodel and relevant web content.This article compared the improved method with the hidden Markov method andBest-First to verify the performance and efficiency of the improved algorithm. Theexperimental results show that the improved method can crawl a large number of highquality pages which are related with the given topic, and the performance is better thanthe hidden Markov model and Best-First method.
Keywords/Search Tags:Topic Web Page Crawling, Hidden Markov Model, Vector Space Model, Topic Correlativity, Vertical Search Engine
PDF Full Text Request
Related items