Font Size: a A A

The Theme Of The Search Engine Web Spider Search Strategy Study

Posted on:2007-03-17Degree:MasterType:Thesis
Country:ChinaCandidate:J ChenFull Text:PDF
GTID:2208360182966668Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With Web information continuing to explode in all directions, traditional Search Engine can't keep up with the more and more rigorous and prolific search requirements from different users. Recently, topic-driven search engine is presented to provide a new search service, which is better classified, containing more profound and focused data, and being updated in time.Nowadays, the web search strategy of the web spider in topic-driven search engine architecture is really hot in research. The dynamic, complex, and semi-structured properties of Web require the web spider to gather data efficiently to keep the information updated and valid.Based on our in-depth research in the search strategy in topic-driven search engine and the topic relativity judging algorithms, this article presents a structure design model of the topic-oriented web spider and then analyzes it in detail.As the key component of search strategy in topic-oriented web spider, the topic relativity judging algorithms ensure the focused web crawling process of the spider. In the process of relativity judging between URL and topic, a novel URL pruning algorithm-EPR algorithm is presented based on the analysis on anchor text and other properties. The popular vector space model is used to classify HTML page from different topics.Topic-driven search engine is supposed to provide the updated web information, so the incremental web crawling is also very important in the search strategy of topic-oriented web spider. In this article, a novel incremental web crawling algorithm based on index page is presented to find new added web pages quickly.The experiment results show that the research work of this article is effective, especially in EPR algorithm and the incremental web crawling algorithm based on index page, which are really creative and valuable in real application environment.
Keywords/Search Tags:Search Engine, Web Spider, Search Strategy, Topic Distillation, Index Page, Incremental Crawling
PDF Full Text Request
Related items