| With the rapid development of the Internet,big data has penetrated into every industry and business function area,and its value has become more and more significant.It is especially important to extract meaningful and valuable data.Therefore,web crawlers for Internet information collection face enormous opportunities and challenges.At present,some large-scale search engines at home and abroad only provide users with non-customizable search services.Stand-alone network crawlers are difficult to perform.The existing distributed web crawlers have strong functions and high efficiency,but are difficult for ordinary users to understand and use.This paper was designed and implemented a distributed custom network crawler system based on distributed collection.It can accurately identify various webpage elements in batches and automatically generate extraction rules,support complex website collection with different webpage structures,meet various collection application scenarios,and efficiently crawl data.The user utilizes the system to visually edit the crawler task script according to his own needs,and automatically completes the collection of crawler data.The main work of this paper includes the following points:(1)Through the research of embedded Qt framework,the embedded browser is developed and has the functions of recording user webpage operation,obtaining element positioning information,intelligently identifying webpage similar elements,etc.,and realizing the terminal graphical interface of the custom crawler system.It is convenient for the user to visually edit the crawler task and finally generate a user-defined crawler task script.(2)Based on the current distributed architecture of scrapy-redis,redis is used as a queue for storage tasks to implement a crawler system based on a master-slave distributed architecture.Aiming at the problem that the heterogeneity of each physical slave node leads to different number of virtual nodes,this paper proposes an algorithm for adaptively adjusting virtual nodes.The physical slave node can adjust the number of virtual nodes according to its own load in real time to ensure the optimal load of the slave node itself.For the task scheduling problem in the central node,a limited load balancing algorithm is proposed.The master node selects the virtual nodes(sets)with smaller loads for the parallel tasks while ensuring that the node sets are allocated on the same physical node.It is convenient to manage the tasks and ensure that the crawler nodes in the system are load balanced.(3)Design and implement custom crawlers using python+selenium technology.The crawler parses the crawler task script and drives the browser to complete the operation of the webpage according to the instructions in the script to implement custom collection.At the same time,in order to prevent the crawler from being affected by the site anti-reptile strategy,a dynamic ip proxy pool is designed and implemented.Through multi-site crawling,timing asynchronous verification of ip validity,real-time monitoring of the number and quality of ip in the proxy pool,to provide high quality ip for the system. |