Font Size: a A A

Research And Design Of A Distributed Web Crawler Based On Hadoop

Posted on:2015-12-26Degree:MasterType:Thesis
Country:ChinaCandidate:J X QianFull Text:PDF
GTID:2298330467462173Subject:Computer technology
Abstract/Summary:PDF Full Text Request
The rapid development of information technology, especially the development of Internet and mobile Internet in recent years, have profoundly changed the world in which we live. IT industry and the combination of IT and traditional industries have become an important part of the world economy. While more and more things are connected by the network, the amount of information people need to face also soared. Therefore, a new topic also produced-how to find valuable information. In the case of individual users, the answer of this problem is search engines. And for those who want to get the hidden value among the huge amounts of data, the answer is data analysis and data mining. For these solutions, the first step is to get vast amounts of information on the Internet. Research topic of this thesis is to get massive information from the Internet, and that is web crawler technology. Since currently stand-alone platform has been difficult to handle the challenges of a large amount of Internet data, this project will use a distribution system as the underlying platform.The main research work of this paper include following aspects:Web crawler technology and the background theory. The explosive growth of the amount of Internet information, derived the search engine technology. And web crawler is an important part of the search engine. This thesis analyzes the technical details of the basic working principle and the important modules of the search engine. The important modules including establish of technical search index and search results sorted. On the basis of these studies, we analyzed the thesis of the principles of Web crawler. For some of the key technologies involved in web crawler, we gives a detailed description. The development of cloud computing and main structure of Hadoop distributed platform. This paper describes the emergence and development of cloud computing, and a detailed study of the key technical characteristics of cloud computing. Hadoop distributed platforms mainly including HDFS, MapReduce programming model and HBase distributed database. We also analyzes technical details of the Hadoop platform as a distributed development framework.Design, implementation, deployment and testing of a distributed web crawler. On the basis of the aforementioned technical research, we research and design a web crawler based on Hadoop distributed platform. We give a detailed analysis of the main functions of the key modules, and the realization in the MapReduce programming framework. We also completed the deployment and testing of the web crawler on a small cluster consists20servers. At last, this design is proved feasible through our experiment. The design based on the open source distributed systems is a good attempt. The study of crawler and distributed systems also have great reference value.
Keywords/Search Tags:web crawler, cloud computing, distributed systemHadoop
PDF Full Text Request
Related items