Statistical models of sequencing error and algorithms of polymorphism detection

Posted on:2007-05-26

Degree:Ph.D

Type:Dissertation

University:University of Southern California

Candidate:Li, Ming

Full Text:PDF

GTID:1458390005486418

Subject:Biology

Abstract/Summary:

Estimation of the sequencing error patterns is the key to accuracy improvement in DNA sequencing. In chapter 2, with probabilistic interpretation of the quality value, we propose a conditional sequencing error model. In chapter 3, we model the sequencing errors by a mixture of multinomials and a mixture of logistic regressions. In these models, the underlying DNA target is unknown and treated as missing data. The redundancy in the assembly allows us to impute the missing data by an EM algorithm. In the mixture of logistic regressions, we use piecewise linear functions of the quality value to deal with nonlinear effects. We evaluate the models with different knots by AIC and single base discrepancy and use a backward elimination algorithm for model selection. We apply these methods to a whole genome assembly, C. jejuni , and improve the accuracy of consensus sequence by a great deal. Based on the predicted sequencing error patterns, we correct the bias in the quality value and assign to base-calls new quality values with probabilistic interpretation. We also make an effort to expand this model by including more covariates, such as GC content. These statistical models provide a framework for the analysis of sequence assembly and can directly be applied to the daily practice of DNA sequencing.; Polymorphism detection in the resequencing data is especially important for linkage analysis and association study. We propose a novel method for trace preprocessing and spike alignment to identify polymorphism accurately. In chapter 4, we introduce a procedure to preprocess DNA sequencing traces and recover signal spike trains in five steps: color correction, normalization, baseline subtraction, width estimation, and deconvolution. In chapter 5, we describe the dynamic programming algorithm for spike alignment, and demonstrate the polymorphism detection from the alignment. Based on this method, we develop a software with graphical user interface for resequencing data and made it available to the public. Our software offers a new perspective for polymorphism detection, especially insertion-deletion polymorphism in mononucleotide runs.

Keywords/Search Tags:

Sequencing error, Polymorphism detection, Model, Algorithm, Chapter

Related items

1	Sequencing Optimization Of Mixed Model Assembly Lines Based On Shuffled Frog Leaping Algorithm
2	Design And Implementationon Single Nucleotide Polymorphisms Identification Software
3	Deformation Of Anti-malicious Code Detection Technology Based On Binary Polymorphism
4	Research On Static Detection And Location Method Of Software Sequencing Constraint Defects
5	A model for production scheduling and sequencing using constraints management and genetic algorithm
6	The Study Of Flight Sequencing Solutions At Terminal Based On Simulated Annealing Geneticalgorithm
7	Research On Mixed-model Assembly Line Sequencing Problem Based On Orders
8	Integrated Detection Of Cope Number Variation Based On Next Generation Sequencing Data
9	Single photon counting for ultra-weak fluorescence detection: System design, characterization and application to DNA-sequencing
10	Research On Biological High-Throughput Sequencing Fragment Assembly And Molecular Biomarker Detection Algorithms