Font Size: a A A

A Distributed Web Crawler For P2P Networks

Posted on:2011-10-21Degree:MasterType:Thesis
Country:ChinaCandidate:J MaFull Text:PDF
GTID:2248330395457867Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Both search engine technology and P2P technology are hot research topics research institutions and companies are currently striving hard to investigate. Various Web-related services are also increasing with the rapid expansion of Web information. In this context, the effect of the search engines as the necessary information retrieval tools of network users has been paid more attention by more and more people. While the Web Crawler is the most important part of the search engine, most of the currently available crawling systems are based on the central server model. This type of crawling systems not only use large amount of hardware resources, but also only search information which is a part of the web.Distributed P2P network architecture has the characteristics such as scalability, roubustness, load balance, and so on. Compared with traditional distributed systems, P2P network topology is more suitable for distributed information retrieval processing. P2P technology allows users to deeply search the documents, and this is done without searching through the Web server and is not subject to restrictions on documents and the host device. It can achieve the unmatched depth that the traditional search engine cans (only20%-30%of the network resources). It can be said that P2P technology provide new methods and ideas to Web information search on the Internet, and it may become the development technology used by the next generation search engines.In this thesis, some problems currently existed in the crawling system of traditional search engines are analyzed. There is a central scheduler that is responsible for the distribution of tasks and results of collection among the nodes in a traditional Web Crawler. On this basis, a better distributed web crawler framework is proposed. It is a fully distributed and non-centralized system. And in this design, DHT can be used to detect URL duplication and Web page content duplication. Subsequently, a corresponding system is designed according to the framework. The system uses P2P search engine topology and makes the Distributed Web Crawler to work in P2P environment; each web crawler corresponds to a node in the P2P topology and is used to collect data coming from the external pages in P2P network environment. Simulation experiments are conducted to verify that the project is feasible. The implementation of this system not only can meet the user’s demand of personalized information, but also can provide a possible solution to the single point of failure and scalability issues with the traditional search engines, and thus is of great significance in making continuous improvement in user satisfaction.Finally, a comprehensive test is conducted on the Distributed Web Crawler system for P2P networks. Experimental results show that the system can obtain the required information correctly, can complete the recommendation work for the user according to the user’s specification, and can adapt very well with the user leaving and joining the network frequently.
Keywords/Search Tags:P2P Networks, Search Engine, Distributed, Web Crawler, P2P Routing Algorithm
PDF Full Text Request
Related items