Font Size: a A A

Indel Evolution In Model Organism And A Detection Approach In Non-model Organisms

Posted on:2014-07-09Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z C ChongFull Text:PDF
GTID:1220330467980030Subject:Genomics
Abstract/Summary:PDF Full Text Request
Indel evolution is important to genome architecture and adaptation of the species. However, the evolution of indels has been insufficiently undertaken. One one hand, this can be studied using model organisms, which have complete genome references. As a demonstration, we studied the indel evolution using the Drosophila genome.We surveyed7500genes between Drosophila melanogaster(D.mel) and Drosophila simulans (D.sim), using Drosophila yakuba(D.yaK) as an outgroup. The evolutionary rate of coding indels is very low, at only3%of that of nonsynonymous substitutions. As coding indels follow a geometric distribution in size and tend to fail in low-complexity regions of proteins, it is unclear whether selection or mutation underlies this low rate. To resolve this issue, we collected genomic sequences from an African isogenic line of D.mel (ZS30) at a high coverage of70X and analyzed indel polymorphism between ZS30and the reference genome. By comparing polymorphism and divergence, we found that the divergence to polymorphism ratio (i.e. Fixation Index) for smaller indels (size≤10) is very similar to that for synonymous changes, suggesting that most of the within-population polymorphism and between-species divergence for indels are selectively neutral. Interestingly, deletions of larger sizes (size≥11and≤30) have a much higher fixation index than synonymous mutations and44.4%of these fixed deletions are estimated to be adaptive. To our surprise, this pattern is not found for insertions. Protein indel evolution appears to be in a dynamic flux of neutrally driven expansion (via insertions) together with adaptive driven contraction (via deletions) and these observations provide important insights for understanding the fitness of new mutations, as well as the evolutionary driving forces for genomic evolution in Drosophila species.On the other hand, indel evolution can be studied using non-model organisms. The innovation of Restriction site Associated DNA sequencing (RAD-Seq) method facilitates the acquisition of genetic markers, especially when the reference is unknown or incomplete. It takes full advantage of high throughput, low cost and automation of next-generation sequencing technology, and could obtain genome wide markers efficiently. Paired-end RAD-Seq has the property of a sharp RAD tagged end and a staggered second end. By clustering paired-end short reads into groups with their own unique tags and locally assembling into contigs, we could generate a reduced representation of the whole genome, which can be used as a reference to identify markers and conduct population genetics studies. However, it is common that there are millons of short RAD-Seq reads need to analyze. Besides, the reads often contain sequencing errors, and the levels of heterozygosity and repetive sequences can be high. How to fast and accurately clustering and assembling millions of RAD-Seq reads is a challenging question.To fast grouping those reads and allowing sequencing errors, we use a spaced seed method to primarily cluster the RAD reads. This step could generate over representative clusters due to repeats. One goal of RAD-Seq analysis is to distinguish repetitive sequences. We then implement a heterozygote calling like strategy to divide potential groups into haplotypes in a top-down manner. Another goal of RAD-Seq analysis is to collapse heterozygous sequences. To achieve this, along a guided tree, we iteratively merge sibling leaves in a bottom-up manner if they are similar enough. Here, the similarity is defined by comparing the2nd reads of a RAD segment. This approach tries to collapse heterozygote while discriminate repetitive sequences. At last, we use a greedy algorithm to locally assemble merged reads into contigs. We could not only output the optimal but also suboptimal assembly results. Thus, we provide an ultra-fast and memory-efficient solution to clustering and assembling short reads produced by RAD-Seq.Based on this strategy, we developed Rainbow for efficient clustering and local assembly of RAD-Seq short reads. Based on simulation and a real guppy RAD-Seq data, we show that Rainbow is more competent than the other tools in dealing with RAD-Seq data. Source code in C, under the GNU General Public License, Rainbow is freely available at http://sourceforge.net/projects/bio-rainbow/files/.
Keywords/Search Tags:indel evolution, MK test, Drosophila, RAD-Seq, clustering, de novoassembly
PDF Full Text Request
Related items