Font Size: a A A

Study And Implementation Of Inverted Index On Hadoop

Posted on:2014-11-28Degree:MasterType:Thesis
Country:ChinaCandidate:W N DaiFull Text:PDF
GTID:2268330401465830Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of the Internet, the amount of data to be processed by thesearch engine becomes more and more large. So, the performance of the search engineshould be getting better and better. Inverted index is a key component of a search enginesystem; the structure and organization of Inverted index and the algorithm of query anddynamic updating have a great impact on the efficiency of information retrieval. Inorder to increase the efficiency of information retrieval, Distributed platform has beenused in Search engine system. Hadoop is a popular open-source distributed platformwhich has been applied to many systems, and achieves very good results. ThroughHadoop Distributed programming become more convenient and parallel computingbecome easier, which will increase the efficiency of system. Therefore, constructing aninverted index structure based Hadoop is important to increase the efficiency of searchengines.This thesis designs and constructs an inverted index structure base on distributedsystems—Hadoop, using the HDFS file system and the Map-Reduce. To some extent, itcan save disk storage space and increase the efficiency of information retrieval.Firstly, this thesis studies and analysis the architecture, the major component, andtwo key technologies--Map-Reduce programming model and HDFS file system ofHadoop; Then it studies the processes of the submission of Map-Reduce job and therunning of the task, analyzes the data flow of the whole process, as well as principlesand methods of application design base on Hadoop; Base on the analysis ofimplementations and related algorithms of the traditional inverted index, verifies thepossibility of realizing it on Hadoop.Base on that, this thesis designs an inverted index structure, which consists of themain index, the segment index, the deleted index and dictionary library. And then wedescribe each part of the structure in detail, Design inverted file storage strategy basedon word frequency and word frequency ranking, as well as the compression of inverteditems—blended coding. We also design an inverted index construction algorithm inMap-Reduce, an inverted index update algorithm base on the segment index, an inverted index delete algorithm base on the segment index and an inverted index query algorithmbase on the dictionary library. At last, we implement above inverted index structure andoperation algorithms on Hadoop distributed cluster environment and then do test andverification.
Keywords/Search Tags:Hadoop, Inverted index, information retrieval, Map-Reduce
PDF Full Text Request
Related items