Font Size: a A A

Visualization Of DNA Sequences In 2D Space

Posted on:2012-07-30Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z J ZhangFull Text:PDF
GTID:1118330335455074Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
With the completion of HGP (Human Genome Project) and the revolution of DNA sequencing technology, DNA sequence data in public biological database grows ex-tremely. Currently, the third-generation DNA sequencing technology with single-molecule-sequencing characterization has emerged. Massive DNA data can be obtained easily at low cost. Mining knowledge from this data, and using the knowledge to benefit human beings become urgent tasks of scientists.DNA sequences are represented as long alphabetical strings on four basic letters, A (Adenine), T (Thymine), G (Guanine) and C(Cytosine). It is easy for computer to save and read. But it is difficult for human beings to review, observe and study. Can we develop a tool to observe and analyze this massive DNA data, and mine knowledge? The technology of visualization of DNA sequences is developed in great demand It not only allows people observe DNA sequences by eyes, but also can be converted into mathematical models and algorithms, and then analyze DNA sequences by mathematical tools and computer. Since Hamori and Ruskin proposed the first visualization model of DNA sequences in 1983, the visualization technology has been flourishing. In this thesis we do research on visualization of DNA Sequences in 2D Space, considering problems of degeneracy, loss of information visualization space, visualization effect; and applications of visualization tools on similarity analysis, mutation analysis, phylogenetic tree construction and so on. The main tasks are:In order to solve the basic problems of degeneracy and information loss, many re-searchers used high-dimensional visualization. In this thesis we points out that two in-surmountable problems in multi-dimensional space lead to that the visualization effect of multi-dimensional visualization is not as good as two-dimensional visualization. Firstly, in the multi-dimensional space, a point can hide behind another point. That is to say, the so-lution of degeneracy is only in theory. Actually observers still see overlapping points, see the circles, and there are still the degeneracy problems. Secondly, it is hard to know the exact value of each axis of a point in multi-dimensional graph. Based on this result, we divide the history of DNA sequences visualization development into three phases:PhaseⅠ, two-dimensional visualization; PhaseⅡ, high-dimensional visualization; PhaseⅢ, return of two-dimensional visualization.Randic proposed a 2D graphical representation of DNA sequences——Spectral, and claimed that Spectral avoids loss of information. But he didn't present a rigorous proof. Here we build two mathematical models for Spectral, and prove that the claim is correct, and that Spectral also avoids degeneracy. In addition, two applications of Spectral are given: similarity analysis of DNA sequences based on Spectral; extended Spectral model to protein sequences, avoiding degeneracy and information loss in visualization of protein sequences, and reflect lengths of protein sequences and amino acids content.A large number of scholars work hard in visualization of DNA sequences. However, it is difficult to handle the following problems in one graph:degeneracy, loss of information, difficulty of observing in multi-dimensional graph, difficulty of visualization when repre-senting long DNA sequences, and need to reflect useful information. Here, DV-Curve(Dual-Vector Curve) using two vectors to represent one alphabet of DNA sequences not only avoids degeneracy and loss of information, but also has good visualization no matter whether se-quences are long or short, and can reflect the lengths of DNA sequences. The applications of the DV-Curve on mutation analysis and two types of similarity analysis are presented. There is also the corresponding software of DV-Curve.Considering the compact visualization space, we study two 2D visualization models based on worm curve:WormBin and WormStep. They not only avoid degeneracy and loss of information, but also are very compact. That is to say,they just need liminted 2D space to visualize long DNA sequences. WormStep is developed on WormBin, but it overcomes a big drawback in WormBin:observers can not grasp the information about DNA base composition, except recoding the binary representation from start point. In addition, we also present applications of WormBin and WormStep on similarity analysis and phylogenetic tree construction. Compared with MSA technology, such as ClustalW, our methods have an important advantage:they are deterministic and polynomial. It is impossible for MSA to achieve, unless P= N P.In this thesis we present a colorful 2D visualization model of DNA sequences—— Color5. Color5 not only avoids degeneracy and loss of information, but also is very compact and colorful, by which it is easier to observe. It is also square, by which sequences can be easy to converted to matrix. Take advantage of the fact that human is more sensitive to color than shape, we use Color5 to analyze mutation. Using the square feature of Color5, we convert it to number matrix, epurate two numerical characterizations——eigenvalue 24-component vector and checksum 96k-component vector, and do similarity analysis on these two numerical characterizations.
Keywords/Search Tags:DNA sequence, Sequence visualization, 2-dimensional visualization, Multi-dimensional visualization, Degeneracy, Loss of information, Similarity analy-sis, Mutation analysis, Phylogenetic tree construction
PDF Full Text Request
Related items