Font Size: a A A

Research On Genome Assembly Method Based On High Throughput Sequencing Data

Posted on:2016-07-16Degree:DoctorType:Dissertation
Country:ChinaCandidate:X ZhuFull Text:PDF
GTID:1108330479478649Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Genome assembly of High throughput sequencing(HTS) data is the fundamental application of bioinformatics. Different with traditional Sanger sequencing data, the HTS data have characteristics of high throughput, short read length and high error rate, and resulting in lots of excellent assembly approaches which are mainly based on overlap graph and De Bruijn graph. They use a fixed read overlap or k-mer size, whereas the fixed length is not suitable for resolving branches and gaps in assembling, and these approaches do not fully utilize the paired-end reads and single-end reads for resolving branches. For these shortcomings, we propose the de novo assembler PERGA(Paired-End Reads Guided Assembler) which utilizes multiple heuristics.Because of the short read length of high throughput sequencing data, assembly errors are introduced in genome assembly. The reference-based approach does not consider the impact of structural variations, and the de novo approach tends to introducing mis-identifications for uneven coverage data. Therefore, there are some biases for error calling. For these shortcomings, we propose the unbiased mis-assembly identification tool mis Finder.The main contents include:(1) An approach for branches based on SVMGenome assembly is mainly based on overlap graph and De Bruijn graph, and there are usually some branches each corresponds to a path, and assembly needs to distinguish the correct path from multiple candidates. Sequencing errors in reads data and repeats in genome are the main reasons for branches. After analyzing the branches, and based on the information, features are extract and used to distinguish the correct path and the incorrect one, and then establish the SVM prediction model to deal with the branches caused by sequencing errors.(2) The look ahead approach for branchesThere are some nonexact repeats with high similarity and some short tandem repeats(e.g. <100 bp) with occurrence positions of the repeats are close(e.g. <100 bp). These repeats will introduce branches. SVM prediction model considers only information at the branch site and locally before the site, regardless of the information after that branch. We design the look ahead approach to deal with the bubbles caused by the nonexact repeats and the branches caused by the short tandem repeats, and separate the different copies of the short tandem repeats, to resolve the branches more accurately and improve the assembly quality.(3) Assembly method PERGA using multiple heuristicsExisting assembly methods usually use a fixed overlap length or k-mer size, which may cannot deal with the repeats in genome and gaps with low coverage, and they do not fully utilize the paired-end reads and single-end reads in assembling. For these limitations, we propose our de novo assembler PERGA(Paired-End Reads Guided Assembler) to resolve the branches in a better way. More specifically, it employs four heuristics, from the most conservative to the most relaxed as follows. i) for each branch, use compatible paired-end reads to extend the path; ii) if no paired-end reads are available, extend with single-end reads, starting from those with the maximum overlap; iii) for multiple feasible extensions, use a machine learning method(SVM) to distinguish one path; iv) if indistinguishable, employ look-ahead approach to search for possible short stretches of nonexact repeats that can be bridged and possible short tandem repeats whose different copies can be separated, before terminating the extension at the branch.(4) An unbiased approach for mis-assembly identificationSeveral tools have been developed to eliminate assembly errors by either i) comparing the assembled sequences with some similar reference genome(Reference-based approach), or ii) analyzing paired-end reads aligned to the assembled sequences and determining inconsistent features alone mis-assembled sequences(De novo approach). However, the former approach cannot distinguish real structural variations between the target genome and the reference genome while the latter approach could have many false errors. We present mis Finder, a tool that aims to identify the assembly errors with high accuracy in an unbiased way, it combines the information of reference(or close related reference) genome and aligned paired-end reads to the assembled sequence. Different type of assembly errors can then be distinguished from the mis-assembled sequence by analyzing the aligned paired-end reads using multiple features derived from coverage and consistence of insert distance to obtain high confident error calls.
Keywords/Search Tags:genome assembly, high throughput sequencing data, branch, support vector machine, look ahead approach, mis-assembly identification
PDF Full Text Request
Related items