Research And Design Of A Distributed Web Crawler Based On Hadoop

Posted on:2015-12-26

Degree:Master

Type:Thesis

Country:China

Candidate:J X Qian

Full Text:PDF

GTID:2298330467462173

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

The rapid development of information technology, especially the development of Internet and mobile Internet in recent years, have profoundly changed the world in which we live. IT industry and the combination of IT and traditional industries have become an important part of the world economy. While more and more things are connected by the network, the amount of information people need to face also soared. Therefore, a new topic also produced-how to find valuable information. In the case of individual users, the answer of this problem is search engines. And for those who want to get the hidden value among the huge amounts of data, the answer is data analysis and data mining. For these solutions, the first step is to get vast amounts of information on the Internet. Research topic of this thesis is to get massive information from the Internet, and that is web crawler technology. Since currently stand-alone platform has been difficult to handle the challenges of a large amount of Internet data, this project will use a distribution system as the underlying platform.The main research work of this paper include following aspects:Web crawler technology and the background theory. The explosive growth of the amount of Internet information, derived the search engine technology. And web crawler is an important part of the search engine. This thesis analyzes the technical details of the basic working principle and the important modules of the search engine. The important modules including establish of technical search index and search results sorted. On the basis of these studies, we analyzed the thesis of the principles of Web crawler. For some of the key technologies involved in web crawler, we gives a detailed description. The development of cloud computing and main structure of Hadoop distributed platform. This paper describes the emergence and development of cloud computing, and a detailed study of the key technical characteristics of cloud computing. Hadoop distributed platforms mainly including HDFS, MapReduce programming model and HBase distributed database. We also analyzes technical details of the Hadoop platform as a distributed development framework.Design, implementation, deployment and testing of a distributed web crawler. On the basis of the aforementioned technical research, we research and design a web crawler based on Hadoop distributed platform. We give a detailed analysis of the main functions of the key modules, and the realization in the MapReduce programming framework. We also completed the deployment and testing of the web crawler on a small cluster consists20servers. At last, this design is proved feasible through our experiment. The design based on the open source distributed systems is a good attempt. The study of crawler and distributed systems also have great reference value.

Keywords/Search Tags:

web crawler, cloud computing, distributed systemHadoop

PDF Full Text Request

Related items

1	Design And Implementation Of Cloud Crawler Subsystem Of Cloud Data Collecting System
2	Research And Implementation Of A Subject-Oriented Distributed Crawler System
3	The Crawler Of Education In Web By Cloud Computing
4	The Research On Web Crawler Technology Based On Distributed Calculation
5	Research And Implementation Of Distributed Web Crawl Based On Hadoop Architecture
6	Design And Implementation Of Health Information Platform Based On Distributed Crawler
7	Research And Implementation On The Technology Of Distributed Web Crawler Based On The Cloud Platform Of Storm
8	Research On Topic Focused Web Crawler And Related Technologies
9	Research On Cooperative Computing Method In Distributed Multi-cloud Architecture
10	Research Of Internet Information Collection System Based On Cloud Platform Web Crawler