Research On Contigs Local Mis-assembly Of High Throughput Sequencing Data

Posted on:2021-03-21

Degree:Master

Type:Thesis

Country:China

Candidate:J W Fan

Full Text:PDF

GTID:2370330611956087

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

High-throughput sequencing technology is a technology that can perform sequencing on hundreds of thousands to millions of DNA molecules in parallel at the same time,because it generates millions of short reads of whole genomes in a short time,the cost is low,and it is widely adopted.Due to the possibility of incorrect splicing of DNA during replication during sequencing,it will have an impact on subsequent gene analysis.In response to this phenomenon,various biological companies are committed to optimizing algorithms for detecting false stitching.Today's recognition algorithms are mainly divided into two types: incorrect stitching based on reference genome recognition;incorrect stitching based on no reference genome recognition.For some eukaryotes,because of their unrelated genomes,the sequencing time is longer and the accuracy is lower.In order to solve these two problems,this paper proposes and implements a high-throughput error detection algorithm,Lo Mo algorithm,based on the no-reference genome.The sequences read by high-throughput sequencing platform are called reads,and the longer sequences obtained by reads are called contigs.The contigs obtained by splicing often contains many errors,the main error is mis-assembly.This algorithm uses two new methods,namely 2k read-length prediction correction and short-read area re-identification.The 2k read-length prediction correction method is to perform a detection and comparison on the long-read length in advance,to predict the position where splicing errors may occur,and after obtaining the predicted value,extract and mark the two ends of the above position.The short read long region re-identification refers to the short read long region feature recognition of the contigs obtained after extraction,and then the final region is obtained by the border trimming algorithm.This algorithm makes full use of the MP data's mapping on contigs in the 2k read length prediction correction,analyzes data with too many pairing distances and inconsistent directions to preselect assembly errors,and reduces the number of false positives from preselected assembly errors based on the reads at the end of the pairing.Able to disconnect incorrect contigs in time at the wrong break point during assembly.This algorithm also makes up for the shortcomings caused by the short reads length of PE data alone.It combines the advantages of long insertion and long span of MP data,and has higher accuracy for the recognition of low-complexity DNA sequences.Finally,experiments on E.coli simulation data prove that the accuracy and sensitivity of this algorithm have good performance.

Keywords/Search Tags:

high-throughput sequencing, assembly error recognition, data alignment, contig

PDF Full Text Request

Related items

1	Assembly And Analysis Of High Throughput Sequencing Data
2	Research On Indel Recognition Method Of High Throughput Sequencing Data
3	Algorithmic Study On Long Read Assembly Error Correction Based On Linked Reads Sequencing Data
4	Research On Genomic Sequence Alignment Methods Based On High-throughput Sequencing Data
5	Research On Genome Missembly Identification Method Based On High-throughput Sequencing Data
6	Optimizing High-throughput Biological Gene Sequencing Data Processing Algorithms Based On Hash
7	Analysis Of Error Model For High-Throughput Sequencing And Decoding Solution Design
8	Whole Microbial Genome Assembly And Analysis Based On Ion Torrent Sequencing Data
9	Assembling Of Klebsiella Pneumoniae Genome Based On High-throughput Sequencing Technology
10	Research Of Cross-kingdom SRNA Data Analysis Method Based On High-throughput Sequencing Data