Font Size: a A A

Research On Contigs Local Mis-assembly Of High Throughput Sequencing Data

Posted on:2021-03-21Degree:MasterType:Thesis
Country:ChinaCandidate:J W FanFull Text:PDF
GTID:2370330611956087Subject:Computer technology
Abstract/Summary:PDF Full Text Request
High-throughput sequencing technology is a technology that can perform sequencing on hundreds of thousands to millions of DNA molecules in parallel at the same time,because it generates millions of short reads of whole genomes in a short time,the cost is low,and it is widely adopted.Due to the possibility of incorrect splicing of DNA during replication during sequencing,it will have an impact on subsequent gene analysis.In response to this phenomenon,various biological companies are committed to optimizing algorithms for detecting false stitching.Today's recognition algorithms are mainly divided into two types: incorrect stitching based on reference genome recognition;incorrect stitching based on no reference genome recognition.For some eukaryotes,because of their unrelated genomes,the sequencing time is longer and the accuracy is lower.In order to solve these two problems,this paper proposes and implements a high-throughput error detection algorithm,Lo Mo algorithm,based on the no-reference genome.The sequences read by high-throughput sequencing platform are called reads,and the longer sequences obtained by reads are called contigs.The contigs obtained by splicing often contains many errors,the main error is mis-assembly.This algorithm uses two new methods,namely 2k read-length prediction correction and short-read area re-identification.The 2k read-length prediction correction method is to perform a detection and comparison on the long-read length in advance,to predict the position where splicing errors may occur,and after obtaining the predicted value,extract and mark the two ends of the above position.The short read long region re-identification refers to the short read long region feature recognition of the contigs obtained after extraction,and then the final region is obtained by the border trimming algorithm.This algorithm makes full use of the MP data's mapping on contigs in the 2k read length prediction correction,analyzes data with too many pairing distances and inconsistent directions to preselect assembly errors,and reduces the number of false positives from preselected assembly errors based on the reads at the end of the pairing.Able to disconnect incorrect contigs in time at the wrong break point during assembly.This algorithm also makes up for the shortcomings caused by the short reads length of PE data alone.It combines the advantages of long insertion and long span of MP data,and has higher accuracy for the recognition of low-complexity DNA sequences.Finally,experiments on E.coli simulation data prove that the accuracy and sensitivity of this algorithm have good performance.
Keywords/Search Tags:high-throughput sequencing, assembly error recognition, data alignment, contig
PDF Full Text Request
Related items