Font Size: a A A

Design And Implementation Of Distributed Network Crawler System

Posted on:2021-01-29Degree:MasterType:Thesis
Country:ChinaCandidate:W HongFull Text:PDF
GTID:2428330602479273Subject:Computer technology
Abstract/Summary:PDF Full Text Request
At present,with the rapid development of society and big data technology,the development of Internet and mobile Internet technology is also showing a different picture.At present,people have more and more demand for information in mobile Internet in their life and work,and the importance of search engine technology is becoming more and more obvious.Mobile Internet information has a lot of applications in all aspects of society.Modern search engine technology has been deeply rooted in people's hearts,and has also been integrated into people's daily life.It has more and more influence on people's daily life.One of the most important technologies in web search engine is web crawler.In order to solve this problem,a new generation of web crawlers is born,which is based on distributed web crawlers.In detail,it is to build a distributed cluster system on many computers and cooperate with many computer clusters in an efficient way.By deploying the crawler on this cluster,the crawler's data capturing efficiency can be improved,so that it can meet the demand of capturing huge amount of data.At present,the distributed storage system is used,which can greatly improve the data storage ability of the whole crawler system.According to the advantages and characteristics of the distributed system,this paper designs and implements the distributed network crawler system in Hadoop environment.First,it introduces what is the distributed system and what is the network crawler.Based on these,this distributed network crawler system is implemented.This system is expected to alleviate the slow speed and low efficiency of the traditional network crawler.The main contents of this paper are:(1)in this paper,we first introduce the technology of web search engine,the key technology and working principle of distributed web crawler,and then analyze the overall architecture design of distributed web crawler system,and then analyze the URL function module,web capture function module,web parsing function module and data storage function module of crawler in detail,through mapred Uce realizes each function module.(2)in the traditional web crawler,the function of web page grabbing is an important reason to affect the efficiency of the system,so we study this function module in depth,carefully analyze and optimize the weight algorithm of URL link.Another point that affects the efficiency of the system is the URL link de duplication function.In order to avoid a lot of repeated work,the URL link queue de duplication algorithm is optimized.Through the improvement of these two points,we hope to solve the situation of slow and low efficiency of the crawling system,so as to improve the crawling speed and accuracy of the crawling system.(3)after the function of the system code is completed,the Hadoop distributed system is built on the experimental computer,and the relevant environment,nodes and IP addresses are configured.Then the function modules of the crawler system are tested on the experimental computer,and the URL link weight algorithm is tested and recorded,and the URL link weight algorithm is tested and recorded.Finally,the data is summarized Information for comparative analysis.The main significance of this paper is to design and implement a distributed network crawler system,which to a certain extent solves the problems of slow speed,low efficiency and poor scalability of the traditional single machine network crawler,and improves the speed and efficiency of crawling information and data from web pages.
Keywords/Search Tags:search index, distributed, web crawler, Hadoop, MapReduce
PDF Full Text Request
Related items