Font Size: a A A

Research On Key Technologies Of Full-text Index Compression In Cloud Environment

Posted on:2019-10-03Degree:MasterType:Thesis
Country:ChinaCandidate:F J BaiFull Text:PDF
GTID:2428330566973374Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of network technology and information technology,social networking,e-commerce,information flow,online games,and multimedia audio and video content have been unprecedentedly prosperous,and information based on texts has grown explosively.People are gradually overwhelmed by the data ocean.in.How to efficiently locate and find the required target information in massive information collection is an urgent problem to be solved,which makes information retrieval one of the most popular technologies today,and it also raises the performance of information retrieval engines.Full-text indexing is a key technology in the field of information retrieval such as search engines and information filtering.It is a key data structure for fast information retrieval.However,the disk space overhead required for storing the index itself is several times that of the original corpus,which will not only cause huge The disk space is wasted,and it is also one of the key factors affecting retrieval performance.Therefore,it is of great significance to study the full-text index compression algorithm,because the compression of full-text index can not only reduce the disk space overhead of the index,but also reduce the disk I/O overhead during retrieval to improve the retrieval performance.In this paper,the most widely used inverted index compression algorithm for full-text indexing has been studied in depth.The main work is as follows: theoretically analyzed the disk space occupation of the typical inverted index compression algorithm;This paper proposes a document identifier assignment algorithm.The segmentation method in adaptive segmentation compression algorithm is not an optimal segmentation problem.An artificial bee colony algorithm is used to optimize the segmentation method,which improves the formula for calculating the fitness of honey,and uses compression performance.Better DGap sequences are compressed,and Golomb Rice coding compression is used for segmentation parameters;there is an iterative optimization process when introducing artificial bee colony optimization algorithms.In the context of big data,the algorithm is implemented using a Hadoop distributed cloud framework.In this paper,the improved algorithm is implemented in Java language.The effectiveness of the improved algorithm is verified by 9 different integer sequences.The implementation of the algorithm is parallelized by introducing the Hadoop distributed cloud framework,and The effectiveness of parallelization is verified on the two standard TREC corpus GOV2 and ClueWeb09 are implemented.
Keywords/Search Tags:Big Data, inverted index, inverted index compression, artificial bee colony algorithm, Hadoop
PDF Full Text Request
Related items