Font Size: a A A

Distributed Web Crawler Technology Based On Hadoop

Posted on:2012-10-14Degree:MasterType:Thesis
Country:ChinaCandidate:B W ZhengFull Text:PDF
GTID:2218330362950478Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Today we are living in a era of information explosion, with the rapid development of the Internet industry, this information is growing exponentially every year, for anytime, the demand for access to information is also increasing, these requirements drive the development of cloud computing. On this background, Google, IBM, Apache, and Amazon and other large companies have invested substantial financial resources to the development of the cloud. Apache Hadoop development platform is a very user-friendly open source cloud computing framework. The distributed crawler system developed in this paper that is in this framework, design and implementation.The purpose of this paper is to design and implement a crawler system based on Hadoop distributed to complete the task of large-scale data collection. Meanwhile, the crawler system collects information for the mainstream news sites in 27 types of languages. The way of collecting information for crawler system is all website-based collection. In addition, the information in 27 languages was also saved separately for cross-language processing.All of the work of research in this paper includes relevant knowledge described cloud computing, Hadoop distributed platform knowledge, principles of Web crawler and survey about development of a distributed crawler. First, research the definition of cloud computing, principles and architecture. Then, in-depth study Hadoop Distributed File System (HDFS) and the distributed computing model (Map / Reduce). Then the article described the principles of crawler system to understand the process of developing a crawler system. Finally, research the development of the current status of the distributed crawler system.These studies above provided the technical foundation for this article, this paper puts forward a design for distributed web crawler based on Hadoop system, including the design for basic flow of the crawler, frame design, function module and the module's Map / Reduce design. Based on the outline design, this paper made the detailed design, and implements the entire system, Including implementation of data storage structure, overall data structure of crawler system and implementation of the various functional modules. Finally, this makes a detailed summary of this article.This article is about the implementation of a distributed crawler system based on Hadoop, the system uses Map / Reduce computing framework consistent with the overall project distributed framework. Solve low efficiency and poor scalability of the single crawler system, improved the speed of information gathering and expanded the scale of information collection. Meanwhile, the system provided data for index module and information processing module of "distributed cross-language information access and retrieval platform ".
Keywords/Search Tags:distributed crawler system, hadoop, hdfs, map/reduce
PDF Full Text Request
Related items