Font Size: a A A

Searching Strategy Research For Intelligent Web Crawler

Posted on:2005-01-04Degree:MasterType:Thesis
Country:ChinaCandidate:L ZhangFull Text:PDF
GTID:2168360155462527Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In recent years, the hotspot in the research of search engine is how to get more and more web pages on the users' interests in the Web resources. In this paper, we carry on the research in the searching strategy of topic web crawler mainly aiming at the problem of increasing the searching efficiency through the improvement on the web crawler's self-adaptability.First of all, we introduce the current achievements of research in web crawler. After the compare of the advantages and disadvantages of some current searching strategies, we conclude that the key problem in increasing the searching efficiency lies on improving on the web crawler's self-adaptability and the veracity in predicting the linkages' importance.To improve on the web crawler's self-adaptability, the algorithm based on combined linkages' reward is proposed, which combines the linkage's immediate reward and the future reward to evaluate linkages' importance. Moreover, we utilize the changes of rewards to speculate about how relevant the candidate page-set is to topics, based on which the crawler can dynamically adjust the relationship between these two rewards resulting in achieving the searching strategy most suitable for the actual searching state. Our experiments show that compared with some traditional algorithms, this algorithm has better performance.To more accurately predict the linkages' value and resolve the problem of topic-drift in traditional PageRank, an improved PageRank algorithm based on topical segments is proposed. This algorithm segments the Web page into blocks and passes the page's PageRank to outlinks in each block in proportion with the block's relativity to the given topic. Moreover, it regards the visited outlink as feedback to modify the block's relevance. The experiment in Web crawler shows that the new algorithm has better performance.Moreover, in this paper a web searching strategy based on inheritance algorithm is proposed, which introduce the inheritance algorithm into the web crawling. It looks the various combination of web information about parent web pages, sibling web pages, the text in linkages and the url tokens as the various gene sequence. Through some genetic operation like cross and mutation, the mode of combination of web information can dynamically change with the actual web resource, resulting in the best searching strategy. Our experiments show that the new...
Keywords/Search Tags:Web spider, Specific search engine, Searching strategy, Pagerank, Genetic algorithm
PDF Full Text Request
Related items