Study And Implementation Of Web Crawler On Cloud Platform

Posted on:2017-11-02

Degree:Master

Type:Thesis

Country:China

Candidate:X Y Liu

Full Text:PDF

GTID:2348330485486050

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the constant development of the Internet, the way we get information is gradually being replaced by the Internet, but at the same time the amount of network information is growing at an alarming rate constantly. Faced with so much information,how to crawl data accurately and rapidly has become hotspot. At present, in order to improve the efficiency of the web crawler that many companies have adopted the distributed web crawler. The web crawler run in parallel over serveral machines and fetch data from the Internet. This thesis designed and implemented a distributed web crawler system on cloud platform to improve the performance and scalability.Firstly, Based on the characteristics of cloud platform, as well as the limitations of the existing variety of distributed web crawler system, we proposed the overall design scheme of the web crawler system. The web crawler system is divided into 3 modules:control node, crawl node and web page management. Control node is responsible for URL management and virtual machine management. The new URL use URL standardization rules and Bloom Filter to remove the duplicate URL, and store it in Redis; Virtual machine dynamically adjust the number of crawl node. Crawl node fetch web pages from Internet, including page download, page parse and data storage. In order to improve page download speed, DNS cache use the data structure of hash chain;page parse use templating to enhance its versatility; data storage use cloud disk to store collected data. Web page management is to manage the proposed web crawler, you can create a task, manage tasks and monitor crawl node.Secondly, we implement the proposed web crawler. Each function module has been developed by Java language, control information between the control node and crawl node interactive use of Socket programming techniques. Page download module use Http Client component to obtain web pages and set the variety of page download failed treatment measures. Page parsing module use regular expressions, XPath and Css Selector to parse data. Based on Spring MVC framework and Jetty container, web page management use JSP, Java Script, AJAX and other web programming technology.Again, the proposed web crawler deployed on cloud platform. From three aspects:functionality, performance and scalability, we tested the proposed web crawler. The test results show that the system has good availability and scalability.Finally, we have done a summary and analysis the shortcomings of the proposed web crawler system, and put forward the future research directions.

Keywords/Search Tags:

cloud platform, distributed system, web crawler

PDF Full Text Request

Related items

1	Research And Implementation On The Technology Of Distributed Web Crawler Based On The Cloud Platform Of Storm
2	Research And Design Of Distributed Data Acquisition System Based On Cloud Platform
3	Design And Implementation Of Health Information Platform Based On Distributed Crawler
4	Development And Research Of Cotton Distribution Trading System Based On Cloud Platform
5	Research And Design Of A Distributed Web Crawler Based On Hadoop
6	Research And Implementation Of A Subject-Oriented Distributed Crawler System
7	Design And Implementation Of Cloud Crawler Subsystem Of Cloud Data Collecting System
8	Research On Topic Focused Web Crawler And Related Technologies
9	Research And Application Of Distributed Wechat Public Platform Web Crawler System
10	Research Of Internet Information Collection System Based On Cloud Platform Web Crawler