Font Size: a A A

The Design And Implementation Of Distributed Web Crawler System Based On Markup Template

Posted on:2020-06-23Degree:MasterType:Thesis
Country:ChinaCandidate:L YangFull Text:PDF
GTID:2428330590450635Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Web crawlers meet the needs of people to quickly obtain the specified network information.However,traditional web crawlers need to customize crawlers for different websites.The development of crawler programs has many disadvantages such as high process,high cost,high difficulty,and low automation At the same time,the disparity between the efficiency of developing web crawlers and the speed of increasing and updating web sites increases the burden on developers.Aiming at this contradiction,this paper designs a distributed web crawler system based on markup template.The system can automatically generate crawler instances according to the crawler template with a small amount of web page information,and complete the collection task.The goal of the system is to provide a large amount of real-time network data for the network public opinion system on the basis of reducing the difficulty of crawler development and reducing the development process.The system is modified and extended based on the Scrapy framework to achieve distributed.The main work includes the following points: defining the crawler template,the system can generate crawler instances and perform the collection tasks according to the elements of the collection website,collection elements and crawler configuration described by the template.A semi-automatic page element extraction algorithm based on markup is designed.The algorithm takes XPath and information of the website as features,uses clustering strategy to generate extraction rules,and realizes automatic extraction.A secondary deduplication scheme is implemented,The compressed URL with expiration time is cached in memory as the first-level cache,and the URL is persisted to the disk in the form of a key-value pair to form a secondary de-duplication.On the basis of guaranteeing no memory overflow,fast de-duplication of incremental crawlers is realized,which increases the stability of the system.By encapsulating the page rendering engine and browser kernel,developers can choose the best way to collect dynamic pages.Finally,according to the characteristics of the collected data,using natural language processing technology to process and clean the data.The developed web crawler system can realize the automatic collection of a large number of websites,reduce the development process,avoid the time-consuming and difficulty of crawler development for a single website,solve the problems of low efficiency and poor scalability of single crawler.Users do not need to learn page extraction grammar.From the test results,we can see that the system can realize efficient incremental collection of a large number of websites through simple configuration,including the collection of dynamic pages,which effectively reduces the difficulty and requirements of user development of crawlers,and improves the collection efficiency while ensuring the accuracy of data.
Keywords/Search Tags:Distributed web crawlers, Template, Deduplicanion, Page rendering
PDF Full Text Request
Related items