The Research And Implementation Of Distributed Search Engine Based On Hadoop

Posted on:2011-06-18

Degree:Master

Type:Thesis

Country:China

Candidate:J Feng

Full Text:PDF

GTID:2178360305471636

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

Distributed Search Engine is a brand new information retrieval system which is consisted of distributed computing technology and full-text retrieval technology. It has changed the way of achieving informations for people and has made it more effectively. Now it has been deep into every aspects of the Internet, and it is known as the first Step of navigation.At present, most of the search engine system are structured similarly - centralized structure, which means all of system's modules are deployed on one server, and it also result in the server must be of high performance,meanwhile, the system still have poor stability and bad scalability. In order to deal with these disadvantanges, people have to purchase very large and expensive servers to satisfy the system requirements, however, not everyone have the ability to afford such high cost. In addition, a primitive string matching mode was adopted to gain the results in many traditional information retrieval systems. Although this method is simple, the search efficiency became very low when data volume is huge, and customers could not retrieve useful informations in time. The two disadvantages mentioned above was a big challenge to the promotion of search engine. In order to deal with this challenge, the technology of distributed computing and inverted document full-text retrieval were introduced into the search engine system.In this paper, it summaried the advantages and disadvantages based on an analysis of several distributed search engine systems. In order to deal with the existing drawbacks, it proposed a distributed search engine based on Hadoop. The main tasks of this paper are to improve the traditional search engine function modules, analyze the steps on the crawling, indexing, searching, in the process, and further decomposed these process that can be excuted disorderly into two parts: data computing and data combining. Then, packaged the algrithm of data computing into Map function, and the algrithm of data combining into Reduce function by using Map/Reduce programming thinkings. After the implementation of these technologies, it improved search engine system could be deployed on a Hadoop distributed environment which was structured by some low-cost PCs, so this system had high response speed, reliability and scalability. Because of the technology closed to the distributed search engine's needs. In this paper, it used Hadoop distributed computing platform as a system. Besides, this paper constructed with keywords for inverted indexing module, by using the inverted document based full-text retrieval technology. And it combined with TF-IDF and PageRank algorithm to improve the page score strategy and optimize the search results.Finally, a detailed analysis of how to use Map/Reduce programming model to achieve system module has proposed as well as the difficulties in the implementation process, and it built a small distributed search engine system with four nodes, the experimental data was achieved by means of the crawling, indexing and retrieving through Internet, and tested system reliability and scalability. In the analysis of this experimental data, the rationality of the distributed search engine based on Hadoop has been validated.

Keywords/Search Tags:

Map/Reduce, Hadoop, Distributed compute, Search Engine

PDF Full Text Request

Related items

1	The Research And Implementation Of Distributed Search Engine Based On Hadoop
2	Design And Implementation On Distributed Product Serach Engine Based On Hadoop
3	Research And Implementation, Based On A Distributed Search Engine Framework
4	The Study Of The Framework Of Distributed Intelligent Search Engine Based On Map/Reduce
5	Cloud Computing Model Research Based On The Search Box/Resource Pool
6	Research And Implementation Of Distributed Search Engine Based On Hadoop
7	The Research And Application Of Search Engine Based On Hadoop
8	The Technology Of Distributed Intelligent Search Engine
9	Research On Key Technologies Of Search Engine Based On Hadoop
10	Research And Implementation Of A Distributed Web Services Search Engine Based On Map/Reduce