Font Size: a A A

Performance Optimization Of A New External Memory Algorithm For Suffix Array Construction

Posted on:2015-07-07Degree:MasterType:Thesis
Country:ChinaCandidate:Y T ChenFull Text:PDF
GTID:2298330422477153Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Suffix array is an array of indices pointing to all suffixes of a character string,sorted according to their ascending (or descending) lexicographical order. It is usedwidely in large-scale applications, e.g., genome analysis, text compression and textsearching. However, most of algorithms for construction of suffix array are designedto run totally in the random memory. They are not adapted to the practical applicationof large scale, because the random memory of limited size is unable to accommodateall of the large-scale arrays in running process and lots of random I/O jobs will needto do with these arrays being put in virtual memory by operation system. In thissituation, running time will become too long to endure. To deal with this problem, weneed to improve those algorithms for construction of suffix array working only withrandom memory, so that it will work efficiently with external memory for a quitelarge string, to a great degree, regardless of the limited size of random memory.In this paper, several classical algorithms working only with random memorywill be analyzed, especially the IS algorithm basing on the idea ofdivide-and-conquer and induced-sorting, which has great time efficiency and muchgreater space efficiency than any other algorithms. However, because of theoperation of random accessing arrays, this algorithm will be unavailable when thememory is unable to afford enough space to store the entire arrays in running process.Then some improvements are introduced, which is made to adapt IS to working withexternal memory, including the following points:1) packet a bunch of continuousand same characters into a struct to reduce space used to store the string,2) appendinformation of the preceding character to each struct created from the string, then sortthose structs by bucket and by index in each bucket, and this result will be regarded as a reference to be matched with those suffixes also sorted by their indexes in theprocess of induce-sorting so as to get the information of preceding suffixes and avoidthe operation of random accessing,3) construct part of suffix arrays in the process ofinduce-sorting to reduce the time of construction,4) name suffixes in the process ofinduce-sorting, so that the preceding result of comparison can be used to avoidredundant comparisons brought by naming suffixes alone after the process ofinduce-sorting. With these changes, IS algorithm can work with the random memoryand disk in great efficiency, and the improved algorithm is called DIS algorithm.
Keywords/Search Tags:IS, Memory, Disk, Construction of Suffix Arrays, Sorting
PDF Full Text Request
Related items