Font Size: a A A

Design And Implementation Of A Distributed Dynamic Web Crawler System

Posted on:2020-08-13Degree:MasterType:Thesis
Country:ChinaCandidate:W T HuFull Text:PDF
GTID:2518306104495704Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the advent of the era of big data and artificial intelligence,the amount of information on the network is exploding.How to quickly and accurately collect the required data from the massive information has become a key research topic.The emergence of some web crawler tools and systems alleviates this situation to a certain extent,but these centralized single crawler systems can not fully meet the demand in actual production.At present,there are a large number of dynamic pages using Ajax technology in the Internet,which also brings great challenges to the design of crawler system.Aiming at the two problems mentioned above,this paper designs and implements a distributed crawler system which supports dynamic page crawling according to the actual problems,aiming at solving the problems of slow crawling efficiency of traditional web crawler and low efficiency of dynamic page crawling.The system adopts the idea of distributed architecture to improve the system structure of the traditional crawler,and separates the slow dynamic page download module into independent distributed services.This distributed design method can solve the problem that the speed of the crawler control module and the dynamic page download module is inconsistent.Each crawler node is a pair of equality structure,and the nodes communicate with each other through the message queue Rabbit MQ.This design scheme provides good scalability and scalability for the system.The system designs a crawler page parser based on webmagic framework.The user crawler program can easily extract the data of the crawled page.The system uses the puppeteer framework which based on node.js to realize the dynamic page downloader.The API provided by this framework can control the Chrome Headless Browser to download the page and simulate the user's operation.The system uses these APIs to design a solution to crawl the dynamic page.The system also provides the storage function of crawling page data and the basic URL filtering function.The distributed dynamic page crawler system designed and implemented in this paper is improved on the traditional crawler,which improves the crawling performance of the crawler program and provides the basis for further research on the crawler system.At the same time,the system also has the function of dynamic page crawling.After optimization and improvement,the dynamic page download module provides good support for asynchronous dynamic page crawling,and the ability of crawling dynamic page has been further improved.The system test results show that the expected function can be achieved and the better performance target is achieved.
Keywords/Search Tags:Distributed crawler, Dynamic web, Puppeteer framework, Message queue
PDF Full Text Request
Related items