Font Size: a A A

Study And Implementation Of Web Crawler On Cloud Platform

Posted on:2017-11-02Degree:MasterType:Thesis
Country:ChinaCandidate:X Y LiuFull Text:PDF
GTID:2348330485486050Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the constant development of the Internet, the way we get information is gradually being replaced by the Internet, but at the same time the amount of network information is growing at an alarming rate constantly. Faced with so much information,how to crawl data accurately and rapidly has become hotspot. At present, in order to improve the efficiency of the web crawler that many companies have adopted the distributed web crawler. The web crawler run in parallel over serveral machines and fetch data from the Internet. This thesis designed and implemented a distributed web crawler system on cloud platform to improve the performance and scalability.Firstly, Based on the characteristics of cloud platform, as well as the limitations of the existing variety of distributed web crawler system, we proposed the overall design scheme of the web crawler system. The web crawler system is divided into 3 modules:control node, crawl node and web page management. Control node is responsible for URL management and virtual machine management. The new URL use URL standardization rules and Bloom Filter to remove the duplicate URL, and store it in Redis; Virtual machine dynamically adjust the number of crawl node. Crawl node fetch web pages from Internet, including page download, page parse and data storage. In order to improve page download speed, DNS cache use the data structure of hash chain;page parse use templating to enhance its versatility; data storage use cloud disk to store collected data. Web page management is to manage the proposed web crawler, you can create a task, manage tasks and monitor crawl node.Secondly, we implement the proposed web crawler. Each function module has been developed by Java language, control information between the control node and crawl node interactive use of Socket programming techniques. Page download module use Http Client component to obtain web pages and set the variety of page download failed treatment measures. Page parsing module use regular expressions, XPath and Css Selector to parse data. Based on Spring MVC framework and Jetty container, web page management use JSP, Java Script, AJAX and other web programming technology.Again, the proposed web crawler deployed on cloud platform. From three aspects:functionality, performance and scalability, we tested the proposed web crawler. The test results show that the system has good availability and scalability.Finally, we have done a summary and analysis the shortcomings of the proposed web crawler system, and put forward the future research directions.
Keywords/Search Tags:cloud platform, distributed system, web crawler
PDF Full Text Request
Related items