Font Size: a A A

The Design And Implementation Of Web Crawler Based On Pagerank Algorithm In The Project Of Malicious URL Detection

Posted on:2011-08-12Degree:MasterType:Thesis
Country:ChinaCandidate:X M WangFull Text:PDF
GTID:2178360308962397Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet, how to effectively collect and use the online information that is enriching and expanding constantly in an explosive pace becomes a huge challenge. Search engine is an effective tool to solve this problem, and efficient Web crawler is the search engine's core technologies. Web crawler which serves for the search engine is a system that could extract automatically web pages from the Internet. It is an important component of search engines. The paper's Web crawler system serves the project of malicious URL detection to provide the tested URLs.This paper gave an overview of the project of malicious URL detection and a brief summary of the experimental results firstly. Subsequently Web crawler's research status, search strategy, PageRank algorithm and other techniques are discussed in detail. Finally, the Web crawler system's detailed design and implementation are discussed.In this paper, a breadth-first search strategy and multi-threaded crawler based on the Java language is implemented. The paper has a detailed discussion on the design and implementation of system's various functional modules, including the analysis of key technologies and solutions. The paper described multi-threaded parallel mechanisms in detail, and use the thread pool to manage these multi-threads; The system adopted a cache caching mechanism in the url scheduling strategy; The system adopted the MD5 algorithm based on LRU algorithm in the clearing repetition of url; The system used the idea of oriented interface programming to facilitate the program's expansibility; The system adopted an improved PageRank algorithm to compute priorities against the requirements of the project of malicious URL detection.The system was tested from the crawling efficiency and the rate of general climbing. Through the analysis of tested data, this system meets the requirements of the project and received good results.
Keywords/Search Tags:Web crawler, Malicious URL detection, Multi-threaded, PageRank algorithm
PDF Full Text Request
Related items