Font Size: a A A

Research On Entropy-Compressed Structure Of Sequence And Text

Posted on:2019-02-06Degree:MasterType:Thesis
Country:ChinaCandidate:C J HongFull Text:PDF
GTID:2370330572451755Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The explosion of data volume has brought us development opportunities and challenge of information storage and retrieval in the information time.String matching can be regarded as the most basic operation of information retrieval and the usual ways to solve this problem are index matching and sequential matching.Index matching such as SA(Suffix Array)is used instead of traditional sequential matching due to its high efficiency.However,the problem of excessive space in SA limits its practicality.How to efficiently store the SA and enable it to support fast query operations has become one of the important research topics in the field of compressed indexes.Related researches in CSA(Compressed Suffix Array)aim at how to efficiently encode?array in recent years.We follow this research route and want to design new CSA structures and algorithm based on the characteristics exhibited by the difference sequences(gap sequences)for different texts,in an effort to improve the compression ratio and query efficiency of CSA in the face of different types of text input.In this paper,we conduct experiments on various types of standard text input based on a double-layer compressed index called GamCSA and find the different gap sequence statistical characteristics which specifically reflect in its 1-gap proportion and average length of 1-gap-Runs where 1-gap represents 1 value in the gap sequence and 1-gap-Runs represents the average times of 1 value consecutive occurrences in the gap sequence.The higher the 1-gap proportion and the longer the length of 1-gap-Runs indicates that the text is more compressible.For this situation,we use a hybrid coding strategy conducting the encoding proportion experiment for various types of texts with appropriate coding.And based on the gap sequence statistical characteristics of the text and encoding proportion as classification,we propose two new entropy compression structures,HiCSA which has an excellent performance on highly-repetitive text sets and NorCSA which can improve index performance on normal text sets.In the HiCSA,an improved Run-length code based on practical problems is applied to good handling of longer 1-gap-Runs text.For a text T with length n,HiCSA has a nH_k+2n log(H_k+1)+n+o(n)space upper bound,where H_k represents the k-order empirical entropy of T.The code of BV+?is applied in the NorCSA structure,which improves the performance of the index when using a single?code,and maintains the space upper bound of 2nH_k+n+o(n).Afterwards,we revolve around these two new structures,HiCSA and NorCSA,to design efficient access and query algorithms to quickly solve count,locate and extract problems.In the design of the query algorithm,we have improved the lookup-table to adapt to the fast decoding operation in the new RL-?code method;and we have also proposed a new structure called phrase-table to pre-store the intervals of suffixes in order to improve the performance of backward search algorithm.Finally,we also pay attention to the influence of the word frequency on query problem.Focus on the distribution of the sampling determine points,we propose a new variable-length sampling strategy with the SA position sampling method,and design the corresponding structure and algorithm to speed up locate problem.Experiments show that these two new entropy compressed indexes proposed by us have better compression rate and query time and they have significant advantages over other mainstream indexes compared especially in locate time;At the same time,the variable-length sampling strategy we proposed has also been validated.It can achieve better locate time than fixed-length sampling on most types of text.
Keywords/Search Tags:SA, compressed index, entropy compression, hybrid coding, variable sampling
PDF Full Text Request
Related items