Font Size: a A A

Space- And Time-efficient Compression And Intersection Algorithms For Inverted Index

Posted on:2019-11-29Degree:DoctorType:Dissertation
Country:ChinaCandidate:S S SongFull Text:PDF
GTID:1368330623450408Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid growth of the Internet,various types of information are scaling up at a far faster pace.The ever-growing data volume and number of users have also brought severe challenges to various types of information systems,especially search engines.Improving the efficiency of web crawling,indexing,and query processing is an effective means to mitigate such challenges.Served as the most commonly used data structure for information retrieval,the inverted index is responsible for information management and query processing.Index compression and query optimization can significantly reduce operational costs while improving efficiency,these two techniques are relatively old and well-established topics in the field of information retrieval.In this thesis,we focus on improving the efficiency of the inverted index,specially the design of spaceand time-efficient compression algorithm and parallel intersection algorithm.The main contribution of this thesis are as follows:1.In order to improve the compression speed of the compression algorithms,we first summarize the partitioning strategies used in List-Oriented Compression into the shortest path problem on a single-source directed acyclic graph,then we the partitioning strategies of AFOR and VSEncoding,including the addition of block folding for AFOR and substituting dynamic programming with approximation algorithms.These optimizations are able to improve the compression speed while maintaining the same compression ratio.We also propose a heuristic partitioned Elias-Fano index compression algorithm,in contrast with multiple sliding windows used in partitioned Elias-Fano index,we only need one sliding window to automatically detect exceptions and build a partition according to the change in block length and compression cost.At the expense of a slight decline in compression ratio and query speed,our algorithm is able to achieve an obvious advance in compression speed.Experimental results show that the compression algorithms proposed in this paper can achieve a better position in the space-time curve than the algorithm before optimization.2.Recent research comes out with an idea of integrating different encoders within the same index,namely,exploiting access biases by compressing frequently accessed regions with faster encoders and rarely accessed regions with succinct encoders. We introduce the notion of Bicriteria Compression and formalize it into the resource constrained shortest path problem on a single-source directed acyclic graph with dual weights,namely its compressed size and decoding time.We adopt a Lagrangian relaxation algorithm to solve this problem,which works in O(n log n)time and O(n)space,with a negligible additive approximation.Furthermore,this algorithm can be extended via dynamic programming pursuing better query efficiency.We perform an extensive experiment to show that,given bounded time/space budget,our method can optimally trade one for another with more efficient indexing and query performance.3.We first revise the multiple set intersection algorithms from previous studies and recently proposed SIMD based intersection algorithms.Then we summarize two factors that affect the efficiency of intersection algorithms,namely,the selection of eliminator and the search algorithms.Current SIMD based algorithms only focus on intersecting two sets at a time and ignore the scenario of multiple sets intersection.Based on these observations,we propose a SIMD based multiple set intersection algorithm,however,it barely satisfies the requirement of intersecting long posting lists as it adopts linear search.To push its efficiency one step further,we first use of the gather instruction provided by AVX2 to simultaneously collect the elements from the discrete indexes and compare them with eliminator at the same time to accelerate the skipping process;then a dual-scale search algorithm is proposed whose search window is able to adjust its size for different posting list lengths.Compared with prior algorithms which simply use empirical parameters for search windows,our search algorithm automatically matches these parameters for optimal efficiency,and greatly improves the performance of the intersection.
Keywords/Search Tags:Inverted Index, Index Compression Algorithm, Index Intersection Algorithm, Shortest Path Problem, Parallel Optimization
PDF Full Text Request
Related items