| In the context of big data,data mining and full utilization have become the core elements for enterprises to enhance their competitiveness.At present,the main method of web data acquisition is web crawlers.For large-scale web data acquisition,traditional web crawlers have problems such as low crawling efficiency,difficult work,and security.Therefore,this paper constructs a distributed web crawler data acquisition method based on the Scrapy-redis framework.An optimized Bloom filter de-duplication algorithm is proposed,and Docker containers are used for project management;an experimental platform for efficient Web data acquisition is built for the agricultural field.The specific research results are as follows:(1)Build an efficient acquisition framework for web data.Improved the Scrapy-redis distributed crawler framework to support multi-node parallel crawling.This method uses yaml files to realize crawler configurability,uses GNE for website body extraction to improve data accuracy,and uses Docker container and Rancher platform to manage distributed crawlers.(2)A method to remove duplicates of massive URLs is proposed.In view of the high space occupancy caused by the traditional disk-based or memory-based URL deduplication method,the Bloom filter algorithm is called and multiple hash functions are used to map the element set,which reduces the memory space occupation,thereby Greatly improve the efficiency of judgment.(3)Developed an experimental system for efficient web data acquisition.Facing the agricultural field,the Ajax method,Flask framework,html,css,js and other technical tools are integrated to develop a verification system for efficient web data acquisition.By entering search keywords to match the URL of the corresponding resource,use the task management console to start or stop the crawler process,obtain the corresponding data,and display and save the data. |