Font Size: a A A

The Research And Implementation Of Distributed Search Engine Based On Mapreduce

Posted on:2013-03-29Degree:MasterType:Thesis
Country:ChinaCandidate:C ZhangFull Text:PDF
GTID:2248330371490212Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
The search engine is a necessary tool for today’s information era of rapid growth. More and more scholars tend to believe that the search engine is an intergrated platform that combines information retrieval, internet services, user behavior analysis and high performance and distributed copmputing. The core of search engine technology is how to obtain data from the internet that has a huge data repository, process data effectively and return accurate information for the user. In order to protect trade secrets, these existing commercial search engines provide strict confidentiality to its core technology and increase the difficulty for studying search engine.On the basis of a detailed analysis on the processes and principles of the search engines as well as the MapReduce programming framework, we build a distributed search engine system combining the Lucene full-text indexing with other open source tools package. Taking advantage of this system, we can not only expand the seach technology but also improve the stability of the existing search engines.The main research works include the following aspects:Firstly, we introduce the principle of general search engines and information processing, and describe the architecture of distributed computing systems. Further, through a detailed analysis of HDFS file system and Hadoop platform which contains the MapReduce programming model, we proposed the architecture of a distributed search engine.Secondly, we analyze the principle of the Web crawler system and its distributed implementation. We research the establishment of the full-text indexing structure, Chinese word segmentation algorithm, muti-format document and page scoring algorithm. According to the distributed transformation for the page scoring algorithm, we determine the module division of the distributed search engine system and the respective functions and design its structure in detail.Thirdly, on the basis of the analysis and design above, the sub-module distributed is achieved. Through testing on the function of the system with the laboratory clusters, we verify the feasibility of our system.Finally, in this thesis we summarize our work and provide discussion on possible future research issues.
Keywords/Search Tags:Search Engine, Distributed, MapReduce, PageRank, Lucene
PDF Full Text Request
Related items