Font Size: a A A

A Space And Time Efficient Algorithm For Constructing FM-index Of DNA Data

Posted on:2016-12-14Degree:MasterType:Thesis
Country:ChinaCandidate:Z WangFull Text:PDF
GTID:2310330488974185Subject:Engineering
Abstract/Summary:PDF Full Text Request
With the first human DNA being decoded into a sequence of about 2.8 billion characters, much biological research, such as sequence alignment, motif finding, has been centered on analyzing this sequence. Analysis and researches on sequences of DNA has been an important subject recently. In order to solve the problem that the FM-index of DNA data cannot be constructed on personal computer whose memory is 4-8 G, this paper focus on the space and time efficient algorithm for constructing FM-index of DNA data. This paper propose an framework of constructing FM-index so that it becomes the theoretical basis of constructing FM-index. This algorithm extends the application of FM-index in practical, it also saves the cost of purchase super computers.FM-index is a kind of full-text compressed self-index. Generally speaking, FM-index can be constructed according to the following steps. Firstly, construct the BWT of text T as text L, then use wavelet-tree to encode and store text L. Secondly, construct the query structure of rank for the internal nodes of wavelet-tree. Finally, take sample of SA and SA-1. With the support of these basic data structures, FM-index can realize efficient operations that called count, locate and extract. FM-index needs 5n Hk(T) + o(n) bits of memory where Hk(T) is the k-order experience of entropy and n is the length of T. FM-index can be stored into memory of personal computer, but it cannot be constructed on personal computer because it needs too much memory or too long time. Through numberous reading and analysis of previous researches, we have a deep understanding of BWT, the relationship between SA, SA-1 and LF mapping, and the characteristics of the rank query structure. Through lots of experiments and analysis, we finally obtain the algorithm in this paper.This paper is divided into three parts to introduce the constructing of FM-index: First, in order to solve the problem that the BWT of DNA data cannot be done on a personal computer, this paper proposes LF-BWT algorithm. In theory, LF-BWT can construct the BWT of DNA data in linear time and use linear space. As experiments show, LF-BWT takes only less than 1 times the original text to construct BWT of DNA data quickly, and the parameters can be adjusted to make a trade-off between time and space so that it further extends the application of LF-BWT. In comparison with the latest mainstream algorithms, LF-BWT also shows an advantage. Second, in order to solve the problem that LF-BWT cannot get the samples of SA and SA-1, we obtain samples with the help of LF mapping in linear time. This algorithm does not store SA completely, every time encounters an sample during every LF mapping, it stores the value. In this way, it greatly reduces the use of memory. Third, in order to further improve the compression ratio and query efficiency of FM-index, this paper presents an improved version of the RRR which called CF. It combines the word-length of CPU and data locality principle so that it improves the hit rates of cache. In theory, the time and space complexity of CF is in keeping with RRR, but it provides a more efficient and practical performance. As experiments show, regardless of random DNA data or real data, CF exhibited a great advantage.
Keywords/Search Tags:FM-index, BWT, Backward Search, Wavelet tree, rank
PDF Full Text Request
Related items