Font Size: a A A

The Research And Implementation Of Distributed Search Engine Based On Hadoop

Posted on:2011-06-18Degree:MasterType:Thesis
Country:ChinaCandidate:J FengFull Text:PDF
GTID:2178360305471636Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Distributed Search Engine is a brand new information retrieval system which is consisted of distributed computing technology and full-text retrieval technology. It has changed the way of achieving informations for people and has made it more effectively. Now it has been deep into every aspects of the Internet, and it is known as the first Step of navigation.At present, most of the search engine system are structured similarly - centralized structure, which means all of system's modules are deployed on one server, and it also result in the server must be of high performance,meanwhile, the system still have poor stability and bad scalability. In order to deal with these disadvantanges, people have to purchase very large and expensive servers to satisfy the system requirements, however, not everyone have the ability to afford such high cost. In addition, a primitive string matching mode was adopted to gain the results in many traditional information retrieval systems. Although this method is simple, the search efficiency became very low when data volume is huge, and customers could not retrieve useful informations in time. The two disadvantages mentioned above was a big challenge to the promotion of search engine. In order to deal with this challenge, the technology of distributed computing and inverted document full-text retrieval were introduced into the search engine system.In this paper, it summaried the advantages and disadvantages based on an analysis of several distributed search engine systems. In order to deal with the existing drawbacks, it proposed a distributed search engine based on Hadoop. The main tasks of this paper are to improve the traditional search engine function modules, analyze the steps on the crawling, indexing, searching, in the process, and further decomposed these process that can be excuted disorderly into two parts: data computing and data combining. Then, packaged the algrithm of data computing into Map function, and the algrithm of data combining into Reduce function by using Map/Reduce programming thinkings. After the implementation of these technologies, it improved search engine system could be deployed on a Hadoop distributed environment which was structured by some low-cost PCs, so this system had high response speed, reliability and scalability. Because of the technology closed to the distributed search engine's needs. In this paper, it used Hadoop distributed computing platform as a system. Besides, this paper constructed with keywords for inverted indexing module, by using the inverted document based full-text retrieval technology. And it combined with TF-IDF and PageRank algorithm to improve the page score strategy and optimize the search results.Finally, a detailed analysis of how to use Map/Reduce programming model to achieve system module has proposed as well as the difficulties in the implementation process, and it built a small distributed search engine system with four nodes, the experimental data was achieved by means of the crawling, indexing and retrieving through Internet, and tested system reliability and scalability. In the analysis of this experimental data, the rationality of the distributed search engine based on Hadoop has been validated.
Keywords/Search Tags:Map/Reduce, Hadoop, Distributed compute, Search Engine
PDF Full Text Request
Related items