Font Size: a A A

Research And Implementation Of Distributed Web Crawler Based On Hadoop

Posted on:2020-11-21Degree:MasterType:Thesis
Country:ChinaCandidate:X C LiuFull Text:PDF
GTID:2428330596979318Subject:Integrated circuit engineering
Abstract/Summary:PDF Full Text Request
With the rapid spread of the Internet and its application to all aspects of human life,the number of data on the Internet has increased dramatically.Users want to find the information they want from such large-scale data must rely on search engines.Web crawler is the core of search engine.It supports the running of search engines by extensively crawling hundreds of millions of web pages on the Internet and indexing them.Therefore,it is of great significance to study efficient and stable web crawler systems.This thesis mainly uses the Hadoop big data platform to research and design a distributed web crawler system.The main works are as follows:1)Design a distributed web crawler based on Hadoop and use HBase database for data storage.It mainly includes a crawling module that can bypass the anti-crawl mechanism of the website,a parsing module that can extract URL which link to the web page from the page,a deduplication module that can use the uniqueness of the HBase rowkey to complete the deduplication function and a HBase storage module that can facilitate data storage and reading of crawler system and PageRank calculation.2)The PageRank algorithm is implemented on the MapReduce distributed computing framework.The computing power of the Hadoop distributed cluster is used to greatly improve the calculation speed of the PageRank value.3)An eight-node Hadoop cluster is set up on the lab server.Each node is installed with a Java development environment,Hadoop,ZooKeeper,and HBase.Then the functionality,performance,stability and scalability of the designed distributed web crawler were tested on the lab environment.Finally,the calculation speed of the PageRank algorithm under MapReduce was tested.The experimental results show that the distributed web crawler system designed in this thesis can improve the efficiency of data collection,and can run stably for a long time,and has good scalability.The calculation speed of the PageRank algorithm under the MapReduce framework also is greatly improved.
Keywords/Search Tags:Distributed web crawler, Page Rank algorithm, Hadoop, Map Reduce
PDF Full Text Request
Related items