Font Size: a A A

The Research And Application Of Search Engine Based On Hadoop

Posted on:2014-10-31Degree:MasterType:Thesis
Country:ChinaCandidate:C X FanFull Text:PDF
GTID:2268330401988399Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
Along with the large scare popularization of Network Information Technology, users‘demands on information search become increasingly stringent. The implementation of a quicksearch with high accuracy and comprehension can bring institutions relatively high customersatisfaction and commercial benefit. Because of the shorts of technology and economic strength,most of medium-small institutions cannot realize a proprietary and effective search enginesystem based on users‘demands just like the large institutions. Meanwhile, it‘s difficult formiddle and small-sized institutions to make a further and personalized design on the searchsystem combined with their own requirements. Therefore, making effective use of thoseavailable technologies of large search engines to serve more institutions, especially those withcertain data sets but with low economic carrying capacity and poor core development abilities isthe research emphasis and difficulty facing the searching field. Among the mentioned,medium-small enterprises, colleges and universities and Scientific Research Institutions are thekey objects.This paper, combining the practical application requirements and the research about therelevant principles, techniques and algorithm of distributed search engine based on Hadoop,integrated the probabilistic model BM25into Lucene for content sorting and used Paodingtokenizer for Chinese participle processing. With a thorough analysis on the distributedcomputing framework MapReduce and distributed file system HDFS, this subject introduced aspecific design scheme on MapReduce programming model, accomplished the architecturedesign of Hadoop and established the system‘s function partition. Meanwhile, detailed analysisand design for the Crawling Subsystem, the Indexing Subsystem and the Searching Subsystemwere presented. Eventually, the paper accomplished the improvement and implementation ofthe search engine system.The paper firstly made deep analysis, evaluations and summaries on the demands and theexisting disadvantages of medium-small institutions which aim to realize an effectiveinformation retrieval system and then it integrated the implementation of the three relativelyindependent subsystems, constructed the Hadoop framework and finished relevant configurations. The last but not the least important was the performance test and evaluationaccording to users‘searching requests. Here www.zstu.edu.cn was the test object and theefficiency of Crawling and Indexing was measured respectively on a3-point distributedplatform and on a single machine. The results of Crawling and Indexing on time consumingindicated that it decreased the execution time by about15.64%on the cluster than that on asingle machine and this increment rate will be much larger with the increment of web pageamount. And the comparison on the correlation degree of search results on different amounts ofweb pages showed the average accuracy of the search engine based on Hadoop was improvedby about20%than that under single machine circumstance. The performance test resultsindicated that when web pages increase to a certain amount, the distributed search enginedesigned for medium-small institutions was more effective in obtaining accurate resultscomparing to the traditional centralized search engine. In addition, it‘s much safer and morereliable and it had better scalability. These advantages will drive medium-small institutions toimprove their search system performance, thus speeding up their information processing.
Keywords/Search Tags:Search engine, Hadoop, MapReduce, Distributed computing, medium-smallinstitutions
PDF Full Text Request
Related items