With the rapid development of the Internet,the information on the network is growing exponentially,and the high level of information brings great challenges to users’ access to information.Web crawler as tools to get the data is often used in search engines,however small and medium-sized system oriented web crawler due to the limitations of its own often face many problems,such as single web crawlers fetching data speed too slow,and the most mature open source web crawler frame are unrealized distributed;Web pages are different in structure,and a single web crawler can’t match all types of web pages and so on.So was designed and implemented a high customizability,simple and stable,and the small and medium-sized high performance distributed crawler has the very vital significance,in this paper,on the basis of Scrapy framework combining Redis database was designed and implemented a distributed web crawler system,users can only simple configuration by the rapid capture to the desired data.The main work of this paper includes the following points:(1)Focusing on the task scheduling algorithm under the master-slave architecture,and a task scheduling strategy based on dynamic feedback is proposed.The master node uses this strategy to perform task scheduling while grasping the real-time status of each Scrapy crawler in the slave node group,and performs corresponding task adjustment when the crawler node changes to ensure dynamic load balancing of each crawler node in the system.(2)Based on the traditional memory or disk URL to lead to the problem of high space utilization rate,when the bloom filter algorithm,this paper puts forward a massive URL to heavy strategy,this strategy by using multiple hash function for the original URL then compress the space mapping data collection,reduce its share of the space,and in the process of query,only through a hash can judge whether the URL to grab,greatly improve the query efficiency.(3)Designed and implemented a multi-node crawler speed limit strategy.The crawler nodes in the cluster can access the corresponding site according to the frequency set by the user.The speed limit in the same machine based on IP crawler nodes access the frequency of a site,based on the crawler type speed limit the access to a similar type of crawler node the frequency of a site.(4)The scheduler in Scrapy framework,spider and data pipeline components to custom development,the development of the scheduler to support distributed acquisition and development of spider to support with the rules of data extraction,data pipeline development to support data cleaning,code conversion,and text extraction,and other functions.(5)Based on the Twisted framework,a crawler manager is designed and implemented for asynchronous task response.The user can easily control Scrapy crawler on each node through this manager. |