Font Size: a A A

Design And Implementation Of A Distributed Web Crawler System Based On Hadoop

Posted on:2017-02-12Degree:MasterType:Thesis
Country:ChinaCandidate:C XueFull Text:PDF
GTID:2348330485487870Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet, the quick growing of big data and cloud computing, as well as the continual increasing of Internet resources, search engine plays an important role in the information retrieval. People have been inseparable from the search engine in the daily life. Search engine can quickly and accurately provide the information needed for people. As the key component of the search engine, web crawler has a direct impact on the performance of search engine. This thesis designs and implements a distributed web crawler system based on Hadoop platform.This thesis introduces Hadoop distributed platform, Hbase database, Storm realtime processing platform and the basic principles of web crawler. Combined with the actual needs of users on web crawler, and the overall goal of web crawler, this thesis designs a distributed web crawler system based on Hadoop platform. It implements the function of each module based on MapReduce computing framework, records the state of grasping and parsing in the Kafka message queue in grasping module and parsing module, as well as counts KPI in real time based on Storm platform. Finally, a Hadoop distributed platform and a Storm real-time processing platform have been built to test the distributed web crawler system designed in this thesis.The web crawler system developed in this thesis has the following features: It's a parallel system since it is based on MapReduce computing framework. It stores the fetched data in Hbase distributed database, therefore the data has been evenly distributed to each node and the data read-write speed has been increased to a great degree. It utilizes Storm platform to count KPI in real time. Its performance is significantly improved relative to the single node web crawler, and system scalability is also enhanced.
Keywords/Search Tags:Search Engine, Web Crawler, Hadoop, Hbase, Storm
PDF Full Text Request
Related items