Font Size: a A A

Research And Implementation On Key Technologies For Distributed Universal Web Crawler System

Posted on:2020-05-07Degree:MasterType:Thesis
Country:ChinaCandidate:R X HanFull Text:PDF
GTID:2428330623956767Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet,the amount of data on the network has grown exponentially.An important means for users to obtain information is search engine,and web crawling technology is responsible for providing data to search engines.When encountering large data capture requirements,you need to use distributed crawler technology to improve the performance of the crawler system through efficient division of labor.The existing distributed crawler framework is not excellent enough in terms of scalability and usability.The distributed general crawler system proposed in this paper adopts the most popular and mature distributed technology,which makes the crawler system perform well in all aspects.The main work of this paper is as follows:Firstly,we provide a time-based scheduling algorithm based on historical data.The scheduling module is the core module of the crawler system,and the quality of the scheduling algorithm directly affects the cost and efficiency of the entire system.Aiming at the problem of inflexible seed paging,a time-based scheduling algorithm based on historical data is proposed.The regression prediction algorithm is used to optimize the scheduling of seed pages.After applying the scheduling module,it achieves good results in cost,hit rate and delay.Secondly,we provide a mining algorithm for URL normalization and deduplication.Deduplication can save a lot of storage space for the system and improve retrieval efficiency when using data later.The mining rules are proposed for the deduplication module to improve the effect of URL normalization and reduce the repetition rate of the page.At the same time,the mining of mirror stations and invalid parameters is realized,achieving extremely high accuracy and good recall rate.Finally,we design and implement a distributed general crawler system,and the physical architecture,logical architecture and data format information of the distributed system are determined.Kubernetes is mainly used to manage modules.Kafka and Thrift are used as pipelines between modules,and the system is monitored for availability by means of a log system and a time series database.After performance testing,this paper analyzes the advantages and disadvantages of the system.
Keywords/Search Tags:distributed, web crawler, kubernetes, time-sensitive scheduling, deletion of duplicated web pages
PDF Full Text Request
Related items