| With the rapid development of Internet technology, network resources are exponentially increasing. The traditional search engine has been difficult to meet people’s need. Then how to quickly and accurately get the information of what we need, which has become a hotspot research in recent years. So the vertical search engine arises at the historic moment.As the core of the vertical search engine, topic crawler only retrieves pages related to the topic, and ignores irrelevant links. It meet the demand of users in different areas and background with Professional, accurate, in-depth characteristics. Most of the traditional topic crawler simply evaluate the priority of candidate links based on the web content, and ignore the link structure of pages which has the same topic in the web. In addition, when crawler meets the irrelevant pages, it can’t get through the tunnel to find more relevant pages, and lead to throw away a lot of links which has the potential value.This paper analyzes the necessity of research on topic crawler and emphasizes the study of the search strategy and technology of tunnel crossing in the process of crawling, and the main research work is as follows:Firstly, based on the current domestic and international topic crawler research progress, this paper introduces the basic principles and key technologies in different stages.Secondly, this paper mainly focuses on the advantages and disadvantages of different search strategies. Besides, it proposes a new topic crawling strategy which is combing content and link value on that basis. This strategy divided the crawling process into crawling state and early crawling state. In early crawling state, it uses heuristic search strategy based on content; In the crawling stage,it uses the search strategy based on the comprehensive value, by combining HITS algorithm, and makes the crawler in evaluation priority of the candidate links. At the same time, considering text content and link structure, the crawler crawling link is not only relevant to the topic but also is valuable in the field.Thirdly, using the formula of distance measure to guide crawler through tunnel, it is different from the practice of past. By distance measure formula, the lower the correlation degree is, the faster the convergence of the distance value is. When the distance is greater than the threshold value and then thoroughly abandon this path, so as to avoid miss the relevant pages.Finally, this paper proposes a measuring method of value evaluation for web pages——average of information, which can assess web page value of relevant pages. In the experimental part, we use the precision and average of information as the main evaluation index, but we find that the advantages of the proposed topic crawler were proved. The experimental results show that the proposed topic crawler has higher precision and average of information, and is more efficient in improving crawling quality. |