Font Size: a A A

Design And Implement Of Parallel Web Crawler

Posted on:2011-11-14Degree:MasterType:Thesis
Country:ChinaCandidate:Q Y GongFull Text:PDF
GTID:2178360305498901Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Along with the computer, communications, network technologies mature and widely used, the Internet since its inception in 1989 has been the rapid development of more than a decade in time, develop into modern information resources of human society. The important component of global increasing number of information entities choose to start the Internet as its main vector. How to help users quickly find the required information from the Internet has become an important research topic. In order to help Internet users to effectively retrieve information via the Internet, a network known as the search engine information retrieval tools came into being, and the required information for the user has set up a communication between the bridges. As the core of the search engines, web spider laid the foundation for it.This article designed and developed a parallel web crawler which is based on Map/Reduce parallel computing model. It implements Master module which is responsible for distributing tasks, and Worker module which is responsible for crawling web pages. In worker module, implementing general crawler module, designing DNS buffer structure, and in order to improve web-crawling efficiency, adding the function to filter duplicated URLs.This article introduces some correlation technique, including HTTP protocol, the capability of the HttpClient library, filtering URLs, DNS buffer strategy. Then introducing parallel computing techniques, and analysis and design each important parts of parallel web crawler. Give the testing results, including URL filter algorithm testing and performance testing of Parallel crawler. Finally summary and forecast of the parallel web spider are also introduced.
Keywords/Search Tags:parallel web crawler, Map/Reduce, HttpClient, URLs filter, DNS
PDF Full Text Request
Related items