Font Size: a A A

Study On LZ Complexity Algorithm And Its Application In Analysis Of Biological Sequences

Posted on:2009-07-28Degree:DoctorType:Dissertation
Country:ChinaCandidate:B LiFull Text:PDF
GTID:1118360278454261Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Lempel-Ziv(LZ) complexity is one of the intrinsical attributes of symbolic sequences. Through appropriate coarse graining and symbolization, biological macromolecules (DNA, RNA, protein) can be represented as one dimensional symbol sequences. It provides an alternative computational approach to solve important problems arisen from Bioinformatics recent years by analyzing and comparing LZ complexity and other LZ complexity related characteristics of biological sequences.Firstly, we defined the concept of Longest Suffix Prefix (LSP) partition of symbol sequence based on properties of LZ complexity, and proofed the equivalence of LSP partition and exhaustive production partition that is specified in LZ complexity definition. LSP-LZC, a new LZ complexity algorithm of non-null sequence was proposed by dividing sequence into their LSP partition form. Making use of the construction algorithm of suffix tree (with suffix links) as the major step, LSP-LZC algorithm takes linear time and space complexity in computation, which shows better efficiency than existing algorithms in computation time.LZ complexity is a numerical attribute of single sequence. To characterize LZ complexity relationship between two sequences, the concept of conditional LZ complexity was proposed. Further, LZ complexity similarity was proposed to measure similarity between sequences based on conditional LZ complexity. Then, LZ complexity similarity was applied to the study of reconstruction of phylogenetic trees. Using DNA sequences of whole mitochondrial and virus genome as input data, the phylogenetic trees of 29 Eutherian animals and SARS-corona-viruses were reconstructed respectively according to LZ complexity similarity. The work results showed significant biological interpretation. The concept of eign-transformation of LZ complexity similarity matrix, and the method of how to construct an LZ complexity kernel or kernel matrix were proposed based on LZ complexity similarity. Then, it was proofed that LZ complexity kernel or kernel matrix holds the property of positive definiteness and the invariance of LZ similarity relationship. We also presented the learning and prediction methods of LZ complexity kernel based support vector machine (SVM) in pattern analysis of symbol sequence. Protein subcellular locations can be predicted by adopting the methods proposed above. Experiments were carried out upon Eukaryotes and Prokaryotes datasets respectively, and high prediction precision was obtained.Furthermore, the methods of comparing protein three-dimensional structures were proposed based on the analysis of LZ complexity similarity between protein contact maps in this paper. Since protein 3D structure cannot be represented in form of symbol sequence, contact map of protein 3D structure was build firstly. Protein 3D structures can be compared indirectly by LZ complexity similarity calculated from their contact maps. The proposed methods were then tested by two experiments performed upon Chew-Kedem dataset and datasets build from SCOP database. All experiments achieved reasonable results.
Keywords/Search Tags:Lempel-Ziv(LZ) complexity algorithm, similarity, kernel and kernel matrix, DNA, protein
PDF Full Text Request
Related items