Font Size: a A A

The Research Of Topical Crawler Search Strategy In Web Page

Posted on:2010-03-19Degree:MasterType:Thesis
Country:ChinaCandidate:H YuanFull Text:PDF
GTID:2178360278969437Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the high speed development of the Internet, General purpose web crawler become increasingly unable to extact the information of the web page effectively while their crawling in this vast network. Topical crawlers are increasingly seen as a way to address the scalability limitations of universal search engines. The context available to such crawlers can guide the navigation of links with the goal of efficiently locating highly relevant target pages. The thesis develope a framework to fairly evaluate topical crawling algorithms under a number of performance metrics. It find that the best performance is achieved by a novel combination of explorative and exploitative bias, and introduce an evolutionary crawler that surpasses the performance of the best non adaptive crawler after sufficiently long crawls. In this thesis also analyze the computational complexity of the various crawlers and discuss how performance and complexity scale with available resources.In this thesis, it propose a new approach based on a Layered Markov Model to distinguish transitions among Web sites and Web documents. Based on this model, we propose two different approaches for computation of ranking of Web site, a centralized one and a decentralized one. Both produce a well-defined ranking for a given Web graph. Then it formally prove that the two approaches are equivalent. This provides a theoretical foundation for decomposing link-based rank computation and makes the computation for a Web-scale graph feasible in a decentralized fashion, such as required for Web search engines having a peer-to-peer architecture. Furthermore, personalized rankings can be produced by adapting the computation at both the local layer and the global layer. It use Lucene and Heritrix these kinds of open source components to esitablish a topical based search engine to test this algorithm. The results show that the ranking generated by our model is qualitatively comparable to or even better than the ranking produced by PageRank.This article also presents a text categorization methods, by analyzing the page title on web subject. This method can reduce the work intensity when other text categorization algorithm in computation. The experimental results show that the use of this categorization algorithm can effectively improve the computational efficiency.
Keywords/Search Tags:Topical Crawler, Search Engine, Markov Model, PageRank Algorithm, Text categorization algorithm
PDF Full Text Request
Related items