Font Size: a A A

Research Of A Distributed Web Crawler Search Engine Based On Web Information Collection

Posted on:2010-02-24Degree:MasterType:Thesis
Country:ChinaCandidate:C S LiFull Text:PDF
GTID:2178360302965941Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the development of network technology, Web information and people's requirement of web information are increased rapidly. It makes web crawler technology faced a great challenge. Stand-alone Web crawler has been hard to undertake the task. The distributed web crawler has high speed of information collection and more scale. It satisfied the people's growing requests for web information. The most important thing of distributed systems research is to design architecture and the solution to the key technology. On the basis of learning from the skills and experience of others, this paper designs and describes the structure of a distributed web crawler, including the hardware architecture and software modules division. Hardware portion is constructed by one PC for control node, several PC for crawling nodes, which are connected in LAN. Software portion is divided into software design of the control node and reptile node software design.Then this paper describes the solution to the key technologies of the distributed systems. System uses II scales hash mapping algorithm for the distributed systems, which could conduct the task partitioning efficiently, and enable the nodes to work co-ordination by messaging, and transmit URL between nodes by non-blocking sockets. Finally it implements a robust, a scalable and configurable, distributed web crawler system, and also analyses the distributed web crawler system carefully. The main objective of this paper is to design a distributed system, on the basis of the original centralized Web crawler of the science and technology resources information retrieval system, so that the original centralized Web crawler to become a distributed network of reptiles, and make the distributed systems have more good robustness, manageability, dynamic configurability, as well as a good performance. The main work includes:1. Analysis of the working principle of a distributed Web crawler and components.2. Research on the domestic and international distribution network crawler systems research production.3. Design the system architecture and the composition of each module.4. Comparison of various task allocation algorithms, and uses II scales hash mapping algorithms to implement the task assignment module of the system.5. Design and implement the communication protocol of each node of a distributed system.6. Implement URL transmission between nodes with non-blocking sockets.7. Design the control node-side software of the control node, enabling it to system monitoring and control.This paper shows the main contents and results of a detailed analysis and description. Full-text content and structure organized as follows: Chapter I: This chapter mainly introduces the research background, significance of issues, and the topics source. By summarizing and analyzing the research situation, and describes the main content of this article.Chapter II: This chapter focuses on the search engine works, as well as a distributed web crawler of some relevant knowledge and research situation.Chapter III: This chapter introduces several topologies of the distributed system, as well as the structural design of a distributed Web crawler program, and introduced the Web crawler composition and the role of each module.Chapter IV: This chapter focuses on several task allocation strategies, a detailed analysis of II-scales hash mapping assignment algorithm, finally introduced a strategy for site-word extract.Chapter V: This chapter focuses on the design of the communication module, including the specific design of messaging module and URL transmission module. It described the protocol of the system, and introduces the non-blocking socket of URL transmission.Chapter VI: This chapter mainly summarizes the content of the research and the results of the work, and proposes further work prospects.This paper aims at the need of national science and technology platform for portal applications system search engine, to design and implement a distributed Web crawler on the basis of the original centralized Web crawler of the portal application system. This article compares the distributed system and the centralized system with their respective advantages and disadvantages, introduces the distributed Web crawler status of the current domestic and international research briefly, and designs by crawling nodes and control nodes of the distributed system architecture. In a distributed system, it implements II-scales hash mapping task allocation algorithm, designs a set of communication protocols for coordination between nodes, and transmits URL by non-blocking sockets between nodes.1,The message communication sub-module replaced by the TCP protocol to send messages from the UDP protocol to send the message. It is used to send the message between the current nodes by the TCP protocol. Because of message passing between nodes is not very frequent, so in order to save system resources, each time sending the message to close the TCP connection end, when once again it need to send the message to re-establish the connection, which leads to very low efficiency of message passing. In order to improve the efficiency, we can use UDP protocol, especially when sending a message to all nodes, so we can greatly improve the transfer efficiency. But the UDP protocol can not guarantee that the message to accept the orderly and reliability, so it needs to adopt a certain algorithms and data structures to implement the message synchronization and ensure the reliability of message transmission.2,It is to make a distributed web crawler support breakpoints continued to climb. Web crawler is a crawler which takes several weeks or even months to run, if the system supports breakpoints continued to climb, then a fatal failure occurs in the system, after the shutdown, or network congestion it will be capable of protecting the site in order to be able to continue after the resumption of break. It is not need to keep the system data backup, just need to be backed up gradually, so that it is able to continue the previous crawl after re-downloading only a small number of pages. 3,It is to make the reptiles climb elegantly. If there are distributed reptiles in a short time and in the same Web server, with a number of TCP connections, or download large amounts of data, it will create a heavy burden on Web servers, so that the server crashes, or some server would consider it rejection service attack so that to refuse the reptiles visit. Therefore, it is necessary to take measures to enable distributed reptiles to visit the same Web server between two accesses with a certain time interval.4,When there is a new node to join, it supports automatic data migration, which referred is the URL the node has already visited. When there is a new node to join the distributed system, if the original node does not move the data belonged to the new node to the new node, it will result in duplication of crawling URL. In order to reduce the amount of data transferred, you can only transmit the checksum value of the visited URL.
Keywords/Search Tags:Search engine, web crawler, distributed systems
PDF Full Text Request
Related items