Research Of Distributed Network Crawler Based On MapReduce Framework

Posted on:2012-08-07

Degree:Master

Type:Thesis

Country:China

Candidate:H B Li

Full Text:PDF

GTID:2218330368982089

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of internet, The network has spread to every corner of society.People's daily behavior and life,more and more dependent on information that provided by the network.Search engine provides the public with convenient search services so is the best tools to provides network of information resources.One of search engine's core component is the network crawler,be able to download vast amounts of Internet pages and provides to search engine to be proessed in order to facilitate information to the user. So it has to be the core task for scientific research.The traditional framework of distributed systems have many problems,so this new distributed MapReduce framework will come into being,and gradually began to get people's attention.This paper designed a new distributed network crawler based on the MapReduce framework.Based on the research at home and abroad,aimed at the reasearch on distributed network crawler technology,we pay attention on the following two key technologies.First, Existing distributed network crawler technology,the link scheduling has a lot of problems in the consistency to meet the tasks assigned.In this paper,we study the algorithm of dynamic hash tree,a fundamental solution to solve this problem. We also do the experiments to prove the efficient by the dynamic hash tree algorithm.It explaining the new link scheduling algorithm,to better serve the search engine system.Second, in the distributed system,need vast amounts of urls are scheduled to migrate and store.And in existing distribution systems,generally use multi-level cache model,and need to design compact data structure for storage operations.This paper presents a Secondary Cache model that based on improved trie tree and file pool,and increase of asynchronous merge and batch mode,while saving the memory, improving the scheduling of information's processing speed and efficiency.Finally,we implemented a distributed network crawler system based on the MapReduce framework.Theory and practice have proved that the application of this key technologies that improve system performance so that it can meet the needs of the Internet under the website.

Keywords/Search Tags:

Search Engine, MapReduce Framework, Url Scheduling, Secondary Cache

PDF Full Text Request

Related items

1	The Design And Implementation Of Cache Aware Scheduling For Mapreduce Platform
2	The Research And Implementation Of Distributed Search Engine Based On Mapreduce
3	Research On Framework And Model Of Web Search And Expert Search Engine
4	Design And Implementation Of The Personalized Chinese Search Engine Based On Secondary Sort
5	Research And Optimization Of Parallel Computing Framework Based On MapReduce
6	Research On Information Advisory Engine System For Financial Investors And Institutions
7	The Research And Application Of Search Engine Based On Hadoop
8	A Distributed Search Engine Of E-Business Topic
9	Research On Dynamic Cache Strategy For Flight Search Engine
10	Research On Streamlining Snippet Cache Of Search Engine