Font Size: a A A

Hybrid Longreaderror Corrcetion Algorithms Based On De Bruijn Graph And K-mers Read Alignment

Posted on:2022-07-16Degree:MasterType:Thesis
Country:ChinaCandidate:G LiuFull Text:PDF
GTID:2518306536454814Subject:Software engineering
Abstract/Summary:PDF Full Text Request
The rise of long read sequencing technology of Pacific Biosciences platform and Oxford nanopore platformhas promoted the development of genomic data analysis.Compared with short read sequencing technology,long read sequencing technology can solve larger and more complex genome assembly problems.However,the error rate of long reads is very high.The error rate of long reads generated by Pacific Biosciences sequencing technology is about10%-15%,and the error rate of long reads generated by Oxford Nanopore sequencing technology is as high as 30%.Sequencing readcorrection is very important for genetic engineering.The existing readcorrection algorithm is not ideal for long read correction with high error rate.To solve this problem,this thesis studies to design effective serial and parallel algorithmsfor hybrid long read correction.Based on de Bruijn graph and k-mers readalignment,this theis proposes a serial algorithm for hybrid long readcorrection called Hd GEC.The algorithm first generates seeds by aligning long reads with short reads of the same speciesand uses the Pg SA index of k-mers in short read to construct a de Bruijn graph with variable value of k.And the algorithm anchors the seeds on the de Bruijn graph and traverses the de Bruijn graph with variable value of kto connect the seeds to form a seed sequencsuch that the sequence path connecting two adjacent seeds covers the area of the long reads that is not aligned with the short reads.Finally,the algorithm expands the seed sequenceto its end by continuously traversing the de Bruijn graph with variablevalue of k.The end of the seed sequence enables the seed sequence to be extended to the end of the original long read to be corrected.Thereby the long read correction can be completed.The algorithm Hd GEC not only has the advantage of sequence alignment-based strategy,which allows to correct the areas covered by long read,but also has the advantage of de Bruijn graph-based method,which can correct the error of the uncovered areas in the long read.Experimental results show that compared with the existing long read correction algorithms,the proposed algorithm obtains overall high-quality corrected sequences,which is more suitable for long read correction of medium and large-scale species in real biological data sets.On the basis of the above work,this thesis designs a parallel algorithm for hybrid long read correction using de Bruijn graph on cluster system.The proposed pararllel algorithm implements distributed parallel computing based on Hadoop and Hazelcast frameworks,and Map Reduce and distributed No SQL,and it improves the shortest path algorithm and uses it to maximize the coverage of kmers between two vertices in the de Bruijn graph,and effectively utilize the coverage information of k-mers in short reads to correct the errors in long reads and reduce the loss of bases in long reads.The experimental results on the real dataset of large species show that,compared with the existing parallel algorithms for hybrid longread correction,the proposed parallel algorithm ParHd GECobtains higher rate for long read correction and base correction rate,and larger value of Gain as a whole,and the parallel algorithm Par-Hd GEC will require less running time when more computing nodes participat in processing in the cluster system,which can effectively utilize the computing power of the increasing nodes in the cluster system.The research result of this theiswill provide the algorithm basis for the application of biological big data analysis using corrected long reads.
Keywords/Search Tags:Long read correction, Hybrid correction algorithm, de Bruijn graph, Cluster computing, Parallel correction algorithm
PDF Full Text Request
Related items