Font Size: a A A

Research On Distributed And Focused Web Crawler Technology And Algorithms

Posted on:2019-01-07Degree:MasterType:Thesis
Country:ChinaCandidate:Y L DongFull Text:PDF
GTID:2438330545990750Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the continuous development of information technology,people gradually realize that getting the latest business information from a large number of noisy networks quickly is very important for enterprises to gain advantages in business competition.It is obvious that the collection of human information is not realistic,so the web crawler technology has come into being.Crawler technology has continuously developed into a parallel,distributed,focused web crawler cluster.At the same time,the structure of the crawler is becoming more and more complex,and various scheduling problems,load balancing and bottleneck problems are also followed.In this paper,in order to solve the processing efficiency,scalability,task allocation and load balance problem existed in the present distributed 6web crawler method,an active acquisition task distributed web crawler method is proposed,in which a sub-controlled module is added into the sub-node to evaluate the node load and operation status,and apply task queue for the central control node.Based on this method as well as the dynamic dual-directional priority task allocation algorithm,a distributed network crawler model is designed,which has the characteristics of load balance,task hierarchical allocation,abnormal node smart identification and exit safely,etc.The practice test shows that the active acquisition task distributed web crawler method can be used to build large-scale distributed crawler cluster effectively.
Keywords/Search Tags:Active obtain, Distributed System, Load balancing, Crawler framework, Multi process, Dynamic priority
PDF Full Text Request
Related items