With the development of the search engines, the use of the Internet is becoming more and more convenient. Through a search engine, we can easily gather a lot of relevant information from the Internet. The Internet contains massive data, but modern search engines only indexed part of the resources. Crawler, one part of the search engine, plays a vital role in search engines. How to crawl more related pages people interested in with limited resources, has become a popular topic both in industry and academic research. Focused crawler has emerged in such circumstance.This paper focuses on the research and implementation of a focused crawler based on topic relevance prediction of anchor text and topic propagation in link structures. As anchor text is short in most cases, existed focused crawlers using anchor text introduce the concept of link context, which contains the surrounding text of the given anchor text. However, the link context may introduce relevant context to an irrelevant anchor text. We regard the topic relevance prediction on a given anchor text as an encoding process in the noisy-channel coding theorem, and thus proposed a prediction method based on statistical machine translation. Only use the web content such as anchor text, is likely to discard the pages which are not relevant to the topic but contains a large number of related links. The analysis of the web link structure is one approach to relieve the above problem. On the basis of some previous work, we proposed a focused crawling algorithm based on topic propagation. The work of the topic relevance prediction of anchor text is also integrated into the framework.Finally we realized a focused crawler prototype. We make comparison of our proposed algorithm with some other focused crawling algorithms. Experimental results show that our proposed algorithm achieved some improvement. |