Font Size: a A A

A Study Of Reference Assisted Misassembly Detection Algorithm Using Short And Long Reads

Posted on:2020-11-22Degree:MasterType:Thesis
Country:ChinaCandidate:C J SongFull Text:PDF
GTID:2370330578457275Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Contigs assembled from the second generation sequencing short reads may contain misassemblies,and thus complicate downstream analysis or even lead to incorrect analysis results.Fortunately,with more and more sequenced species available,it becomes possible to use the reference genome of a closely related species to detect misassemblies.In addition,long reads of the third generation sequencing technology have been more and more widely used,and can also help detect misassemblies.Here,we introduce ReMILO,a reference assisted misassembly detection algorithm that uses both short reads and long reads.ReMILO is divided into two modules according to the used data.(1)misassemblies detection module based on short read length and reference genome.ReMILO aligns the initial short reads to both the contigs and reference genome,and then constructs a novel data structure called red-black multipositional de Brujin graph to detect misassemblies.This data structure is a variant of the de Brujin diagram,which uses the short read alignment information to fuse the positional information of the short read on the contig and the reference genome into the graph,which is equivalent to reconstructing the contig,each base of contig can find the corresponding node in the graph.We finally use the position difference of short reads aligned on the contig and the reference genome to achieve the purpose of detecting misassemblies.(2)misassemblies detection module based on long read length.The first part of this module is to correct long reads.MECAT is currently among the fastest self-correction algorithms,but its throughput is relatively small.ReMILO wrapped MECAT to achieve high throughput long read self-correction while keeping MECAT's fast speed.FLAS finds additional alignments from MECAT prealigned long reads to improve the correction throughput,and removes misalignments for accuracy.In addition,FLAS also uses the corrected long read regions to correct the uncorrected ones to further improve the throughput.The final part of this module is to detect misassemblies.ReMILO aligns the contigs to the corrected long reads and find their differences from the long reads to detect more misassemblies.In our performance test on short read assemblies of human chromosome 14 data,ReMILO can detect 0.5-13.3%extensive misassemblies and 2.5-15.5%local misassemblies more than the existing algorithms,false detection is 0.1-12.4%lower.On hybrid short and long read assemblies of S.pastorianus data,ReMILO can also detect 1.1-14.2%extensive misassemblies and 0.6-23.4%local misassemblies.Experimental results on multiple datasets demonstrate that ReMILO has good sensitivity and precision in detecting misassemblies.
Keywords/Search Tags:Misassemblies, Long reads, Reference genome
PDF Full Text Request
Related items