Font Size: a A A

A Long Read Hybrid Error Correction Algorithm Based On Segmented PHMM

Posted on:2022-07-16Degree:MasterType:Thesis
Country:ChinaCandidate:L Y HuFull Text:PDF
GTID:2480306335496724Subject:Organic Chemistry
Abstract/Summary:PDF Full Text Request
Although the next-generation sequencing technology has advantages in throughput and accuracy,it can't cross the repeat region when the data volume is large due to its read length,which makes the downstream analysis difficult.The third generation DNA sequencing technology can produce longer sequences.Although it can make up for some of the weakness of the second generation sequencing,the disadvantage is that the number of wrong bases will also increase,so the correct rate is about 85%.In this case,researchers usually complement the advantages of short reads and long reads,using short reads to correct long reads,so as to improve the accuracy of the sequence without losing the length of the sequence as much as possible.Although the error rate of LR sequence is about 15%,compared with the whole LR sequence,the correct rate of bases still account for a large proportion.If the forwardbackward probability calculation and Viterbi decoding of the whole LR sequence are carried out,not only the running time,but also the calculation redundancy will be caused.Therefore,on the basis of Hercules algorithm,we propose a new hybrid error correction scheme.The matching part in the long reads is not processed,and the uncovered part or the poor matched part is corrected by the p HMM-based algorithm to reduce the running time while maintaining high accuracy.Error correction is mainly divided into two parts,preprocessing based on short reads alignment and Error correction based on p HMM.The advantage of this scheme is to combine p HMM and the aligner based on Hercules,which not only reduces the dependence on the performance of the aligner to a certain extent,but also reduces the running time of using p HMM for global error correction.To evaluate this method,we applied it to E.coli data set and Saccharomyces cerevisiae data set.We compared the experimental results with the uncompressed Hercules error correction stage,and found that while maintaining the accuracy,the running time of E.coli was reduced by 65%,and the running time of Saccharomyces cerevisiae was reduced by 4.7 times.
Keywords/Search Tags:Bioinformatics, DNA sequence analysis, pHMM, Error correction
PDF Full Text Request
Related items