Research And Implementation Of A Subject-Oriented Distributed Crawler System

Posted on:2012-07-21

Degree:Master

Type:Thesis

Country:China

Candidate:J S Gao

Full Text:PDF

GTID:2298330467978356

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Current rapid increasing of Internet has been generating massive information resources and bringing all sorts of network user groups. As a result, various applications in search engine area have been rising. However, a single PC, because of its handling ability and storage capacity, can not be qualified for information retrieval in a fast, efficient and accurate way. Mainframes can solve this problem, but they need relatively high costs. Thus, the demand of researching and developing a new search engine technology with a low price is getting more and more urgent. What’s better, the cloud computing technology springing up in recent years provides a new choice to finish the work. In view of the impact of Cloud computing, academic and business industry have been already carrying out the relevant technological study and application, in which contains search engine technology.In view of the above, this thesis carries out a research on a subject-oriented crawler system based on the hadoop open platform, takes advantage of the server cluster to build the hadoop environment, and implements the crawler system at last. This thesis starts with cloud computing technology architecture, probing into two most important distributed file systems. Secondly, a frequently-used key/value database in the Cloud computing field, called Berkeley DB, is deeply researched. Thirdly, the prototype of crawler, which is Heritrix, is analyzed on code level in order to get prepared for rebuilding the Heritrix to a subject-oriented crawler. Based on these basic works, this thesis puts forward three subject models containing a subject model based on dictionary, a subject model based on text analysis and a subject model based on page structure analysis. Working on these models, the thesis brings up the architecture of a subject-oriented crawler system. And subsequently, the architecture of both the master node and crawler nodes are designed. Then, the thesis gets an in-depth study and application of the key technologies, and implements this crawler system at last. A large number of tests have been done in lab, which indicate that the system achieve the target, with good availability and scalability.

Keywords/Search Tags:

Cloud computing, key/value database, hadoop, distributed crawler system, subject model

PDF Full Text Request

Related items

1	Research And Design Of A Distributed Web Crawler Based On Hadoop
2	Research And Implementation Of Distributed Web Crawl Based On Hadoop Architecture
3	The Research On Web Crawler Technology Based On Distributed Calculation
4	Design And Implementation Of The Online Shopping System Based On Hadoop Cloud Computing Framework
5	Research On Topic Focused Web Crawler And Related Technologies
6	Distributed Web Crawler Technology Based On Hadoop
7	Research Of Internet Information Collection System Based On Cloud Platform Web Crawler
8	Research And Implementation Of Distributed Web Crawler Based On Hadoop
9	Design And Implementation Of A Distributed Web Crawler System Based On Hadoop
10	Research On Optimization Of Hadoop Distributed Web Crawler System