Font Size: a A A

The Research On Distributed Large Data Full-Text Retrieval Based On Block Linked List Index Algorithm

Posted on:2019-03-05Degree:MasterType:Thesis
Country:ChinaCandidate:Y YangFull Text:PDF
GTID:2348330566464288Subject:Engineering
Abstract/Summary:PDF Full Text Request
This paper focuses on the research of distributed large data full-text retrieval.Firstly,it analyzes the limitations of traditional full-text retrieval algorithms when dealing with large data sets.Secondly,it improves the problem in the traditional full-text retrieval algorithm about the difficulty of extending and updating of incremental and low retrieval efficiency.This paper proposes a new BLIS algorithm which is large data oriented and based on block linked index structure,and further improves the BLIS algorithm under the Storm distributed framework,and proposes a distributed large data oriented full-text retrieval algorithm(D_BLIS),The specific research contents are as follows:(1)Analysis of the limitations of the traditional full-text retrieval algorithm in dealing with large data document collection.Focusing on the traditional full-text retrieval algorithm in the process of index creation,the update process and the retrieval process when dealing with the collection of large data documents,especially analyze the problem on the difficulty in data structure expanding,incremental retrieving and the efficiency of retrieval.In the traditional full-text retrieval algorithm,it is mainly uses the forward index and index method,while the forward index and inverted index are both sequential storage structure.When indexing a large data set,the sequential storage structure is limited by the size of the preset storage of the data file,which makes it difficult for the storage medium to support a single and huge index file.When index data updated incrementally,the index data will move to a large number of data locations due to the increase or deletion of lexical items,which is not feasible under the condition of large document content and frequent updating of documents.When retrieving the index,the length of inverted information of lexical item in each index file is very long,and the information of each main index table of each index file is different.A retrieval operation requires traversal of each index file and all its elements,resulting in a longer time and higher cost.In addition,in the distributed computing framework,the index files generated by the traditional full text retrieval algorithm are generally above the GB order of magnitude,and the size of the files is so huge that the single storage medium is difficult to store it.In order to adapt to distributed storage,we need to split the index files.When the index files are sequential storage structure,the split file structure will be destroyed,and it is difficult for subsequent retrieval operation.While the index file is stored in multiple nodes.The length of the inverted information of the items contained increases in the index file in each node gradually,and the information of the primary index table of each index file is different.A retrieval operation needs to retrieve all the indexes in each node file and all elements are traversal,resulting in high retrieval time-consuming and cost.(2)To improve the inverted index structure of the traditional full-text retrieval algorithm,a new full-text retrieval algorithm(BLIS)based on the block-linked list index structure is proposed.Aiming at the problem that the traditional full-text retrieval algorithm cannot expand the data structure easily when dealing with the large data document set,this paper proposes a method which combining block and linked-list,splitting the master index and the slave index of the inverted index,and adds the list structure after each term of the main index structure.The index information of each document in the index is processed according to the document number,so that the structure between master index and the slave index is scalable.Aiming at the problem that traditional full-text retrieval algorithm is difficult to update increments when processing large data document collection,it uses a method of presetting common term table Term List and fixing length x.It presets common terms in main index and block index to construct index terms,and sets the fixed length x for generation of non-included terms in main index and update of index information in block units,to reduces the amount of data movement due to the insertion or deletion of terms.In order to solve the problem of low retrieval efficiency in traditional full-text retrieval algorithms,a unique master index structure is constructed based on block index and linked list index.Each term in the master index points to the block index list(Block Index),each node in a block index linked list refers to a block id and a document number(Doc Id),each node of block index points to a block unit structure containing the term.When retrieve,it only queries the block unit index that related to the retrieval term through the master index and the block index without traversing all.(3)Faced with the distributed computing framework,the BLIS algorithm is further improved,and the D_BLIS algorithm is proposed for full-text retrieval algorithm,which is suitable for Storm distributed computing framework.The improvement of BLIS algorithm under Storm distributed framework is actually the application of BLIS algorithm in Storm framework.Using Topology stream data sequence computing model,the BLIS algorithm is refined to adapt to the working mechanism of Storm distributed framework,we will call the improved BLIS algorithm as D_BLIS algorithm.First,based on the Storm flow data sequence calculation model,the index creation module,index update module and retrieval module of BLIS algorithm are fine-grained according to the stream data sequence,and these modules are further split into sets of computing unit Bolt with low granularity operations.These Bolt computing units are responsible for word processing,building block unit indexing information,creating master index information and block indexing information,integrating Index information and other logics.Then according to the data structure of the input data,the intermediate data and the result data of BLIS algorithm,it creates a set of data sources Spout.Spout is responsible for pulling data from the data source and concurrently sending a message to the next computing unit Bolt,and finally constructing the total Topology calculation model of D_BLIS algorithm.It shows that the D_BLIS algorithm improved can make the fine-grained parallel computing unit run automatically at each node defined in the Storm cluster,since the master index and slave index adopt the method of presetting the common term table Term List and the fixed length x,so that the merge operation of the master index and slave index after the parallel calculation are turned into merge operation of a simple fixed-length sequence table,and it reduces the situation of a large number of data position movement when insert or delete the terms.Because of the use of block-based index structure of the linked list,the structure between master index and slave index is scalable,so that each node in the Storm cluster does not need to traverse all the index files in parallel processing.In the process of retrieval,only need to retrieve the master index file and the block index file related to the retrieval term,thus reduce the retrieval time and improve retrieval efficiency.In addition,it also reduces the complexity of developing parallel real-time processing tasks for the implementation of a distributed large data full-text retrieval system.
Keywords/Search Tags:Large Data, Full-text Retrieval, Block Linked List Index Structure, Storm Distributed Computing
PDF Full Text Request
Related items