Research And Implementation On Key Technologies For Distributed Universal Web Crawler System

Posted on:2020-05-07

Degree:Master

Type:Thesis

Country:China

Candidate:R X Han

Full Text:PDF

GTID:2428330623956767

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of the Internet,the amount of data on the network has grown exponentially.An important means for users to obtain information is search engine,and web crawling technology is responsible for providing data to search engines.When encountering large data capture requirements,you need to use distributed crawler technology to improve the performance of the crawler system through efficient division of labor.The existing distributed crawler framework is not excellent enough in terms of scalability and usability.The distributed general crawler system proposed in this paper adopts the most popular and mature distributed technology,which makes the crawler system perform well in all aspects.The main work of this paper is as follows:Firstly,we provide a time-based scheduling algorithm based on historical data.The scheduling module is the core module of the crawler system,and the quality of the scheduling algorithm directly affects the cost and efficiency of the entire system.Aiming at the problem of inflexible seed paging,a time-based scheduling algorithm based on historical data is proposed.The regression prediction algorithm is used to optimize the scheduling of seed pages.After applying the scheduling module,it achieves good results in cost,hit rate and delay.Secondly,we provide a mining algorithm for URL normalization and deduplication.Deduplication can save a lot of storage space for the system and improve retrieval efficiency when using data later.The mining rules are proposed for the deduplication module to improve the effect of URL normalization and reduce the repetition rate of the page.At the same time,the mining of mirror stations and invalid parameters is realized,achieving extremely high accuracy and good recall rate.Finally,we design and implement a distributed general crawler system,and the physical architecture,logical architecture and data format information of the distributed system are determined.Kubernetes is mainly used to manage modules.Kafka and Thrift are used as pipelines between modules,and the system is monitored for availability by means of a log system and a time series database.After performance testing,this paper analyzes the advantages and disadvantages of the system.

Keywords/Search Tags:

distributed, web crawler, kubernetes, time-sensitive scheduling, deletion of duplicated web pages

PDF Full Text Request

Related items

1	Research On NLP-Based Duplicated Web Pages Deletion Algorithm
2	The Implementation And Application Of Removing Duplicated Web Pages Based On Bloom Filter
3	Design And Implementation Of Distributed Web Crawler System Supporting Dynamic Web Pages Paring
4	Research On Noise Reduction And Duplicated Webpages Deletion Method For Accident News Corpus
5	Design And Implementation Of Large-scale Internet Information Real-time Extraction System
6	Research On The Algorithm For Chinese Duplicated Web Pages Detection
7	Research And Implementation Of Distributed Web Crawler
8	Research And Optimization Of Distributed Crawler System Based On Nutch
9	Product Tag Extraction Based On User Reviews Under Distributed Crawler
10	Design And Implementation Of Distributed Web Crawler Based On Hadoop