Font Size: a A A

Research On The Search Strategy Of Web Spider Based On Specific Topic

Posted on:2010-04-01Degree:MasterType:Thesis
Country:ChinaCandidate:C C ChenFull Text:PDF
GTID:2178360278972375Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the growth of diversified Web information, the traditional search engines, namely, general search engines have been unable to satisfy people's personalized information retrieval service. In recent years, the topic-oriented search engine came into being in order to provide more comprehensive and accurate data, lower time complexity of Internet search services.In the subject search engines, which search strategy Web spiders use to visit Web efficiently is one of hot issues in the study of search engines in recent years.The dynamic, heterogeneous and complex nature of networks demand Web spider to crawl Web link information efficiently.First of all,based on domestic and international network research progress,based on the analysis and comparison of the existing search strategy 's advantages and disadvantages of Web spider,this paper discuss the accuracy and importance of topic value prediction to Web documents.Secondly, as the core of a topic search strategy of Web spider, this artical detailed introduce the expression of topic information and relevance algorithm between topic and Web page. For the page relevance judgement, vector space model which is currently more commonly is used.Thirdly,this paper presents enhanced HITS Algorithm,that is Topic-HITS, put the topic characteristics into HITS algorithm, analyze the link structure of Web pages from the topic which is a more detailed particle, for each page,introduce authority vector based on topics, and further discuss the calculation of authority and hub vectors from the site level.Finally, in order to enhance the self-adaptive of Web spider, In this paper,to solve the single evaluation criteria,present an integrated comprehensive crawling strategy.This strategy changes according to different stages of search.In this study, the improved HITS algorithm and a comprehensive strategy are combined, implement a search engine prototype based on a variety of search strategies.The experiment results show that this system not only be able to crawling pages related to the topic accurately and automatically, but also to save network bandwidth and have a good stability.
Keywords/Search Tags:topic search engine, crawling strategy, crawling algorithms, content analysis, link analysis
PDF Full Text Request
Related items