Font Size: a A A

Method Study On Sequencing Data Analysis Of Semiconductor Sequencer

Posted on:2018-07-12Degree:MasterType:Thesis
Country:ChinaCandidate:S ZhaoFull Text:PDF
GTID:2310330542991339Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
With the emergence of Precision Medicine,gene sequencing is receiving much more concerns.Gene sequencing can predict the risk of illness and propose the treatment earlier.Semiconductor sequencer has become one of the mainstream sequencers with fast,low-cost and convenient sequencing features.Semiconductor sequencer use the semiconductor chips to convert chemical signals into voltage signals to achieve this,thus the semiconductor chips can replace the traditional optical sequencing technology.Although semiconductor sequencer has these advantages,the sequencing accuracy is not high on the repeated nucleotides(homopolymer),and the measured length of the homopolymer is also not accurate.Such limitations affect its accurate identification of the various genetic variants.Semiconductor sequencers use the sequencing voltage to determine the length of the homopolymer.In this process,the same lengths of the homopolymers have different sequencing voltages,which lead to a low sequencing accuracy in the range of homopolymer.To resolve the problem,in this paper,the sequencing voltages were researched thoroughly.Firstly,the original sequencing data of the semiconductor sequencer are preprocessed and aligned to the reference genome.Then,the relevant sequencing data are extracted and the information is grouped according to different factors which influence the sequencing voltages.Then the distributions of the voltage signals in each group of sequencing data are analyzed and found that the distributions of sequencing voltage in each group fit the normal distributions.Finally,according to the distribution of sequencing voltages,a model based on Bayesian theory is proposed to predict the lengths of homopolymers,and then the dynamic programming algorithm is used to calculate the alignment scores between the sequences and the reference genome.Furthermore,an integrated model based on the Naive Bayes classifier and dynamic programming algorithm is proposed to predict the lengths of homopolymers and to correct the sequencing biases based on the predicted homopolymer lengths.The experimental result showed that the proposed model has an error rate of 0.054% in prediction of lengths of homopolymers.Semiconductor sequencer carried algorithm has an error rate of 2.111% in the prediction of lengths of homopolymers.Of all the homopolymer sequencing data that were incorrectly determined by the semiconductor sequencer,97.453%of the error could be corrected by the method described here.The integrated model significantly improves the alignment accuracy of the semiconductor sequencer in thehomopolymer regions.
Keywords/Search Tags:Gene sequencing, semiconductor sequencer, homopolymer, sequencing voltage, integrated model
PDF Full Text Request
Related items