Font Size: a A A

Research And Implementation Of A Subject-Oriented Distributed Crawler System

Posted on:2012-07-21Degree:MasterType:Thesis
Country:ChinaCandidate:J S GaoFull Text:PDF
GTID:2298330467978356Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Current rapid increasing of Internet has been generating massive information resources and bringing all sorts of network user groups. As a result, various applications in search engine area have been rising. However, a single PC, because of its handling ability and storage capacity, can not be qualified for information retrieval in a fast, efficient and accurate way. Mainframes can solve this problem, but they need relatively high costs. Thus, the demand of researching and developing a new search engine technology with a low price is getting more and more urgent. What’s better, the cloud computing technology springing up in recent years provides a new choice to finish the work. In view of the impact of Cloud computing, academic and business industry have been already carrying out the relevant technological study and application, in which contains search engine technology.In view of the above, this thesis carries out a research on a subject-oriented crawler system based on the hadoop open platform, takes advantage of the server cluster to build the hadoop environment, and implements the crawler system at last. This thesis starts with cloud computing technology architecture, probing into two most important distributed file systems. Secondly, a frequently-used key/value database in the Cloud computing field, called Berkeley DB, is deeply researched. Thirdly, the prototype of crawler, which is Heritrix, is analyzed on code level in order to get prepared for rebuilding the Heritrix to a subject-oriented crawler. Based on these basic works, this thesis puts forward three subject models containing a subject model based on dictionary, a subject model based on text analysis and a subject model based on page structure analysis. Working on these models, the thesis brings up the architecture of a subject-oriented crawler system. And subsequently, the architecture of both the master node and crawler nodes are designed. Then, the thesis gets an in-depth study and application of the key technologies, and implements this crawler system at last. A large number of tests have been done in lab, which indicate that the system achieve the target, with good availability and scalability.
Keywords/Search Tags:Cloud computing, key/value database, hadoop, distributed crawler system, subject model
PDF Full Text Request
Related items