Font Size: a A A

Research On Inverted Index Compression Method For Dataspace Based On Interpolative Code

Posted on:2016-11-12Degree:MasterType:Thesis
Country:ChinaCandidate:F Y FanFull Text:PDF
GTID:2348330542475883Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Today,the science and technology develop rapidly.Data from the Internet is increasing at an astonishing speed.More and more people realize the importance of dealing with such large-scale data.Search engine has become a significant way to obtain information.It should response efficiently and accurately to the users when confronted with the surged web data.One of the most important technologies of search engines is the index.The inverted index is an effective and commonly used structure in the search engine.Compressed indexes can effectively reduce the volume of the data.On the one hand,the compressed indexes can occupy smaller space in the disk and on the other hand,it can reduce the volume of reading or writing data.At the same time,the users get responses more quickly.So to compress the inverted index have a great significance.The index compression algorithm provided in this paper studies the inverted index.In this paper,we have carefully analyzed the existing compression algorithms,such as bit compression,? coding,Golomb coding,interpolative coding and so on.These are commonly used in inverted index studying.Through exploring the applications of the various of compression algorithms,we can see their merits and demerits.On this basis,an improved interpolative coding algorithm has been raised in this paper to improve the compression and search query efficiency in dataspace.Firstly,we raised a new method to calculate the similarities of documents sets in preprocessing.It just considered the adjacency of document addresses in the inverted list.Compared with traditional similarity function,its pertinence is stronger.It can make the arrangement of document sets more fit the mixed compression algorithm,which is raised in this paper.Secondly,for the process of interpolative coding,we do not compress its head and tail pointers.In this way we can improve the interpolative coding's compression performance.Compared with bit coding,we find interpolative coding is more suitable for the compression,whose addresses are neighbouring in the inverted list but has high requirement for the auxiliary stack.The bit compression is more suitable used in high frequency words.In this paper,it presents a hybrid compression algorithm,which contains the advantages of bit compression and interpolative coding.By using bit compression' fusion,it can overcome the shortcoming of interpolative coding,who require high demands for auxiliary stack.This paper selects the structured and unstructured data toverify the performance of the hybrid compression algorithm.Experimental results show that hybrid compression algorithm make great improvement in the compression performance and search query efficiency.
Keywords/Search Tags:Inverted Index, Interpolative Code, Bit Compression, Dataspace
PDF Full Text Request
Related items