Study And Implementation Of Inverted Index On Hadoop

Posted on:2014-11-28

Degree:Master

Type:Thesis

Country:China

Candidate:W N Dai

Full Text:PDF

GTID:2268330401465830

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the development of the Internet, the amount of data to be processed by thesearch engine becomes more and more large. So, the performance of the search engineshould be getting better and better. Inverted index is a key component of a search enginesystem; the structure and organization of Inverted index and the algorithm of query anddynamic updating have a great impact on the efficiency of information retrieval. Inorder to increase the efficiency of information retrieval, Distributed platform has beenused in Search engine system. Hadoop is a popular open-source distributed platformwhich has been applied to many systems, and achieves very good results. ThroughHadoop Distributed programming become more convenient and parallel computingbecome easier, which will increase the efficiency of system. Therefore, constructing aninverted index structure based Hadoop is important to increase the efficiency of searchengines.This thesis designs and constructs an inverted index structure base on distributedsystems—Hadoop, using the HDFS file system and the Map-Reduce. To some extent, itcan save disk storage space and increase the efficiency of information retrieval.Firstly, this thesis studies and analysis the architecture, the major component, andtwo key technologies--Map-Reduce programming model and HDFS file system ofHadoop; Then it studies the processes of the submission of Map-Reduce job and therunning of the task, analyzes the data flow of the whole process, as well as principlesand methods of application design base on Hadoop; Base on the analysis ofimplementations and related algorithms of the traditional inverted index, verifies thepossibility of realizing it on Hadoop.Base on that, this thesis designs an inverted index structure, which consists of themain index, the segment index, the deleted index and dictionary library. And then wedescribe each part of the structure in detail, Design inverted file storage strategy basedon word frequency and word frequency ranking, as well as the compression of inverteditems—blended coding. We also design an inverted index construction algorithm inMap-Reduce, an inverted index update algorithm base on the segment index, an inverted index delete algorithm base on the segment index and an inverted index query algorithmbase on the dictionary library. At last, we implement above inverted index structure andoperation algorithms on Hadoop distributed cluster environment and then do test andverification.

Keywords/Search Tags:

Hadoop, Inverted index, information retrieval, Map-Reduce

PDF Full Text Request

Related items

1	A Research Of Full-Text Retrieval Based On Inverted Index
2	Research On Key Technology Of Massive Image Retrieval Based On Hadoop
3	Research On SSD-based Inverted Index Construction And Maintenance Strategies
4	Research On Fast Text Retrieval Methods And Optimization Of Engineering Realization
5	Design And Implementation Of Multi-Keyword Parallel Ciphertext Retrieval System Based On Inverted Index
6	Study On Hadoop-based Inverted Index
7	Research On Key Technologies Of Full-text Index Compression In Cloud Environment
8	Research And Implementation Of Image Retrieval Based On BoVW Model In The Hadoop Platform
9	Theoretical Research And Application Of Computing Ad Retrieval System
10	The Design And Implementation Of Information Retrieval And Retrieval Analysis Subsystem Of Scientific Research Literature