| After years of informatization construction in the agricultural field,it has achieved relatively good results,and the agricultural information resources on the major websites are all in full bloom.However,as the agricultural informatization process is carried out under the old institutional framework,its foundation is relatively weak,which makes the agricultural information resources have the characteristics of mass and decentralization,resulting in the contradiction between the increasingly rich network information resources and the lack of personalized data acquisition platform,and the coexistence of the waste of agricultural information resources and the lack of acquisition platform.Under this circumstance,how to innovate the data collection methods in the agricultural field has become an important application topic of modern information technology in the agricultural field.Web crawlers are considered as the key to solving this problem as a tool for collecting data.In order to solve the above problems,based on Python and Scrapy framework environment,taking meteorological websites and prices of agricultural products as crawling objects,this research exploratorily designs a subject content recognition algorithm based on BERT model,which is used to evaluate the relevance between Web links and subject content.Finally,a distributed agricultural network data acquisition platform based on Scrappy-Redis is implemented.The whole work of this research is mainly divided into five parts:(1)Aiming at the disadvantage that traditional search engine returns results without professionalization,this research designs an Xpath topic content extraction algorithm based on Python and an agricultural topic content recognition algorithm based on BERT model,and focuses on how to evaluate the relevance between Web links and subject content through the agricultural topic content recognition algorithm based on BERT model.It is applied in projects that collect agricultural product prices.The research shows that the algorithm has a relatively high recognition effect in the natural text analysis of the agricultural field.(2)Aiming at the problem of whether web crawler technology can be applied in agriculture,this research chooses Scrapy framework with simple operation and complete functions,designs an experiment of data acquisition of Agrometeorological network based on Scrapy framework,verifies the applicability of Scrapy framework in the field of farmer topics,and lays a foundation for subsequent use of web crawler to collect agricultural network data.(3)Aiming at the slow speed of collecting information by general web crawlers,a distributed crawler framework based on Scrappy-Redis is designed and applied to collect agricultural products price.The Schedule component and Item Pipeline component in Scrapy stand-alone frame are redeveloped for agricultural projects,enabling them to perform distributed acquisition tasks.The distribution module is composed of one Master host and four Slave slaves.The research shows that compared with single-machine network crawler,the distributed crawler has a multiplier improvement in data acquisition speed.(4)Aiming at the attack of some websites on the crawler program,a crawler protection mechanism is designed,and some strategies to deal with the anti-crawler are preset,such as sending User-Agent to check the anti-crawler and adjusting the frequency of access,which effectively avoids the risk of attack,strengthens the robustness of the crawler system and consolidates the stability of the network data platform in the agricultural field.(5)A network data acquisition platform for agriculture was designed.Using Qt and other program frameworks,the interface of each acquisition module is designed.Based on the above work,a distributed agricultural network data acquisition platform based on Scrapy-Redis is implemented in this research,which combines the theme content recognition algorithm and web crawler technology. |