Font Size: a A A

Research Of Distributed Network Crawler Based On MapReduce Framework

Posted on:2012-08-07Degree:MasterType:Thesis
Country:ChinaCandidate:H B LiFull Text:PDF
GTID:2218330368982089Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of internet, The network has spread to every corner of society.People's daily behavior and life,more and more dependent on information that provided by the network.Search engine provides the public with convenient search services so is the best tools to provides network of information resources.One of search engine's core component is the network crawler,be able to download vast amounts of Internet pages and provides to search engine to be proessed in order to facilitate information to the user. So it has to be the core task for scientific research.The traditional framework of distributed systems have many problems,so this new distributed MapReduce framework will come into being,and gradually began to get people's attention.This paper designed a new distributed network crawler based on the MapReduce framework.Based on the research at home and abroad,aimed at the reasearch on distributed network crawler technology,we pay attention on the following two key technologies.First, Existing distributed network crawler technology,the link scheduling has a lot of problems in the consistency to meet the tasks assigned.In this paper,we study the algorithm of dynamic hash tree,a fundamental solution to solve this problem. We also do the experiments to prove the efficient by the dynamic hash tree algorithm.It explaining the new link scheduling algorithm,to better serve the search engine system.Second, in the distributed system,need vast amounts of urls are scheduled to migrate and store.And in existing distribution systems,generally use multi-level cache model,and need to design compact data structure for storage operations.This paper presents a Secondary Cache model that based on improved trie tree and file pool,and increase of asynchronous merge and batch mode,while saving the memory, improving the scheduling of information's processing speed and efficiency.Finally,we implemented a distributed network crawler system based on the MapReduce framework.Theory and practice have proved that the application of this key technologies that improve system performance so that it can meet the needs of the Internet under the website.
Keywords/Search Tags:Search Engine, MapReduce Framework, Url Scheduling, Secondary Cache
PDF Full Text Request
Related items