Font Size: a A A

Research On Compressed Self-index And Its Application Based On Neighbor Function

Posted on:2022-01-07Degree:MasterType:Thesis
Country:ChinaCandidate:P LongFull Text:PDF
GTID:2518306602493044Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the development of the information age,the rapidly increasing amount of data has brought challenges in both storage and retrieval.Although databases and indexing methods based on inverted indexes solve some of the problems,they cannot be solved in the fields of gene comparison and pattern string exact matching.Although traditional indexes such as suffix array(SA)and suffix tree(ST)have efficient pattern matching performance,they cannot meet actual needs due to the large space occupation.On the other hand,gene sequencing technology has developed rapidly in recent years,and the decline in sequencing costs and sequencing cycles has produced massive amounts of sequencing data,bringing challenges to sequencing data storage and analysis.As the mainstream format for sequencing data storage and analysis,SAM file contains complete information required for downstream data analysis,but it requires more storage space.Most of the current research on SAM files stays at the compression level,and does not solve the fast query problem required for downstream data analysis.In order to solve the large space usage required by the suffix array,compressed full-text selfindex came into being.Compressed suffix array(CSA)is an important branch of compressed self-index,and its core is the efficient representation of the neighbor function ?.This paper analyzes the distribution characteristics of the difference sequence of ? in different texts,establishes a hybrid coding method,and proposes a high-order entropy compressed text index structure,called GeCSA.On this basis,a compressed index structure for SAM files is proposed using the idea of compressed index,called SIC SAM.The main innovations of this thesis are as follows:(1)A high-order entropy-compressed text self-indexing,call GeCS A,based upon the neighbor function ? is proposed,and a method for fast retrieval in compressed space is established.GeCS A proposes the hybrid coding method by an analysis of the distribution of the gap sequence of the neighbor function ? for different texts,and designed a compression structure for the neighbor function ?.For a text T of length n,the theoretical space occupation of GeCS A is nHk+2n log(Hk+2)+n+o(n),and Hk represents the higherorder empirical entropy of T.Given any pattern P of m symbols,the Count query of GeCSA to return the suffix range of pattern P runs in 0(mlogn)time;the Locate query of GeCSA to locate all the occ occurrences of P in T runs in O(mlogn+occ(log n(log log n)2))time;and the Extract query of GeCSA to extract a text substring T[st,st+ len-1]runs in O((len+log n log log n)log ?)time,given position st where a suffix of T begins and its length len,where ? is the size of the alphabet of T.In addition,the use of suffix array value sampling reduces the actual query time.(2)A compression index method SIC SAM for SAM files is proposed.SIC SAM divides the twelve columns in the SAM file into six types:POS_Index,SEQ,QUAL,INT_Type,STRING_Type,and OPTIONAL_Type.After preprocessing these six types,the index structure is constructed using the idea of compressed index,which greatly reduces the storage space of the SAM file.In response to the query requirements of SAM files,the readCounts method is designed to query the number of reads matching any reference sequence interval within O(log n)time.The getInterval method is designed on the basis of readCounts to efficiently extract the SAM file information matched to any reference sequence range.The GeCSA proposed in this paper is compared with other latest compression index methods in terms of space occupation,locate and extract query.Experiments show that GeCSA usually has better compression efficiency and query efficiency than other compression index methods.Especially in highly repetitive texts,GeCSA's data has great advantages.The SIC SAM method proposed in this paper is compared with the same type of SAM lossless compression query method in terms of space occupation,short read count and SAM file extraction.Experiments show that SIC SAM also has excellent performance.Compared with the same type of method,it shows a good trade-off between query time and compression rate.Especially in terms of compression performance,SIC SAM uses a smaller space than the same type of method.
Keywords/Search Tags:Neighbor Function, Compressed Suffix Array, Hybrid Encoding, SAM Format, Compressed Index
PDF Full Text Request
Related items