Research And Implementation Of Distributed Web Crawler Based On Hadoop

Posted on:2020-11-21

Degree:Master

Type:Thesis

Country:China

Candidate:X C Liu

Full Text:PDF

GTID:2428330596979318

Subject:Integrated circuit engineering

Abstract/Summary:

PDF Full Text Request

With the rapid spread of the Internet and its application to all aspects of human life,the number of data on the Internet has increased dramatically.Users want to find the information they want from such large-scale data must rely on search engines.Web crawler is the core of search engine.It supports the running of search engines by extensively crawling hundreds of millions of web pages on the Internet and indexing them.Therefore,it is of great significance to study efficient and stable web crawler systems.This thesis mainly uses the Hadoop big data platform to research and design a distributed web crawler system.The main works are as follows:1)Design a distributed web crawler based on Hadoop and use HBase database for data storage.It mainly includes a crawling module that can bypass the anti-crawl mechanism of the website,a parsing module that can extract URL which link to the web page from the page,a deduplication module that can use the uniqueness of the HBase rowkey to complete the deduplication function and a HBase storage module that can facilitate data storage and reading of crawler system and PageRank calculation.2)The PageRank algorithm is implemented on the MapReduce distributed computing framework.The computing power of the Hadoop distributed cluster is used to greatly improve the calculation speed of the PageRank value.3)An eight-node Hadoop cluster is set up on the lab server.Each node is installed with a Java development environment,Hadoop,ZooKeeper,and HBase.Then the functionality,performance,stability and scalability of the designed distributed web crawler were tested on the lab environment.Finally,the calculation speed of the PageRank algorithm under MapReduce was tested.The experimental results show that the distributed web crawler system designed in this thesis can improve the efficiency of data collection,and can run stably for a long time,and has good scalability.The calculation speed of the PageRank algorithm under the MapReduce framework also is greatly improved.

Keywords/Search Tags:

Distributed web crawler, Page Rank algorithm, Hadoop, Map Reduce

PDF Full Text Request

Related items

1	Research On Web Page Rank Based On Improved Page Rank Algorithm
2	Distributed Web Crawler Technology Based On Hadoop
3	Research And Implementation Of Vertical Crawler Based On Hadoop In Distributed Environment
4	Study Based On Hadoop Distributed Web Crawler
5	Research Of Distributed Web Crawler Based On Hadoop
6	Reach On Map-Reduce Application Based On Hadoop
7	Reach On Map-reduce Application Based On Hadoop
8	Large-scale Bilingual Parallel Corpus Collection System Based On Hadoop
9	Design And Implementation Of Distributed Web Crawler Based On Hadoop
10	Distributed Crawler Based On Hadoop