Font Size: a A A

Analysis Of Error Model For High-Throughput Sequencing And Decoding Solution Design

Posted on:2016-08-04Degree:MasterType:Thesis
Country:ChinaCandidate:C XiaFull Text:PDF
GTID:2310330503476778Subject:Biomedical engineering
Abstract/Summary:PDF Full Text Request
Recently, High-throughput DNA sequencing plays an increasingly essential part in life science research. After years of development, the high-throughput sequencing platforms achieved significant improvement in terms of time consuming and throughput, and sequencing cost has been reduced as well. Nevertheless, the high-throughput sequencing technology still suffers from high error rate. In addition, the market for all commercial sequencing instruments and reagents have been monopolized by foreign manufacturers. To break this situation, it is necessary to develop domestic sequencer with independent intellectual property. On the basis of the AG series sequencers which are powered by State Key Laboratory of Bioelectronics in Southeast University, this study does research on the resource of system errors of the AG-100 sequencer and its error correction model to raise the accuracy of its sequencing data and develop the base-calling algorithms and software. Then design decoding schemes for AG-200 sequencing platform which implementing two-base coding technology.In this thesis, the sequencing errors prevalent among high-throughput sequencing platforms are firstly described, and a number of corresponding correction tools are introduced. Then a base-calling pipeline is set up on the basis of the AG-100 high-throughput sequencing platform which utilities ligating strategy in the sequencing process. Fluorescence spectrum crosstalk correcting and phasing calibrating are two main tasks for this pipeline. For crosstalk, it is treated as a linear conversion problem and a corresponding mathematical model is then constructed. It is known that figuring out the crosstalk matrix may be a crucial step in the correcting work flow. Thus an iterative algorithm is adopted to estimate the crosstalk matrix and the intensity data would be corrected during the iterative process. In order to calibrate the phasing errors, a sequencing read should be segmented into some reads according to a particular order of a ligation sequencing run. Then parts of the segmented reads are selected to set up phasing matrices, respectively. And these matrices are applied to the separated reads. After that the fragments of one read should be combined together again. Finally, a software is developed implementing this base-calling pipeline for the AG-100 platform. This software receives fluorescence intensity data as input, and outputs a document containing sequences and quality scores which is similar to the fastq format.For more efficiently sequencing, according to sequencing by ligating, a new sequencing strategy with two nucleotide simultaneously synthesized was put forward by our laboratory. Compared to sequencing with single nucleotide synthesized, the former one obtains more accurate data, but the sequencing result is not as intuitive as the latter one, so it needs an additional decoding step. A corresponding decoding scheme is proposed, and it was tested with simulated dataset and the decoding results are entirely correct. Then this scheme was extended to the case of containing sequencing errors, all coding patterns of the three coding sequences are deeply interpreted and the schemes are put forward for the possibly emerged sequencing errors.
Keywords/Search Tags:high-throughput sequencing, fluorescence spectrum crosstalk, phasing, sequencing with two nucleotide synthesized simultaneously, decode
PDF Full Text Request
Related items