Font Size: a A A

BARNACLE: An assembly algorithm for clone-based sequences of whole genomes

Posted on:2003-02-26Degree:Ph.DType:Thesis
University:Rutgers The State University of New Jersey - New BrunswickCandidate:Choi, VickyFull Text:PDF
GTID:2461390011489215Subject:Computer Science
Abstract/Summary:PDF Full Text Request
The clone-based sequencing strategy is the strategy adopted by the International Human Genome Sequencing Consortium (IHGSC) to sequence the human genome. The data set generated by the clone-based approach consists of a set of Bacterial Artificial Chromosome (BAC) clones, typically 100–200Kb. Each BAC clone is individually sequenced and assembled. In general, a clone consists of a set of more or less nonoverlapping fragments. The sequence assembly problem is to reconstruct the genome sequence by assembling the fragments. The primary difficulty of the problem is caused by repeats, which are sequences that appear two or more times in the genome. In addition, the assembly problem is further complicated by laboratory errors, such as chimeras and contaminations; by the draft quality of the sequences, which are sometimes inaccurate; and by polymorphism.; In this thesis, we propose an assembler, BARNACLE, that is based on a mathematically justifiable (unlike previous work) approach for assembling the sequences. First, we assemble the consistently overlapping fragments, which is a necessary condition for the true assembly. Then we use the clone-overlaps to resolve fragment-level inconsistencies. Making use of the fact that the clone graph must be interval, we detect and remove the repeat-induced overlaps. Furthermore, this also allows us to detect chimeric clones. In order to resolve the non-interval graphs, we design an efficient algorithm which is based on a divide-and-conquer method. We then use the interval representation of BAC clones to order and orient the subassemblies. Finally, the additional information derived from plasmid-read, EST and mRNA data is used for orientation purposes. Moreover, unlike the other two existing algorithms, we are able to detect several types of errors in the input data.; We illustrate our approach by assembling the public working draft of the human genome, which consists of a set of finished and draft BAC clones produced by IHGSC. It takes 3 minutes on a Pentium III (933 MHz) computer for BARNACLE to assemble the April freeze of the public working draft of the human genome. We present and compare our results on the April freeze with the two public assemblies: NCBI's assembly and GIGA SSEMBLER's assembly. We also present our findings of suspected input errors in this data set.
Keywords/Search Tags:Assembly, Genome, Clone, Sequence, /smcap, Data, BAC
PDF Full Text Request
Related items