Font Size: a A A

Research On Genome Missembly Identification Method Based On High-throughput Sequencing Data

Posted on:2022-11-19Degree:MasterType:Thesis
Country:ChinaCandidate:Y MengFull Text:PDF
GTID:2480306749458104Subject:Philosophy of science and technology
Abstract/Summary:PDF Full Text Request
With the vigorous development of high-throughput sequencing(HTS)technology,more and more scholars are devoted to the study of genome assembly.Whereas,due to the genome complexity,the length of short reads,and the inadequacy of the assembly algorithm itself,contigs/scaffolds produced by assembly may have misassemblies,such as insertion,deletion,and misjoin,which adversely affect downstream data analysis.Although some genome assembly algorithms that have appeared in recent years have significantly improved the accuracy of assembly,the problem of misassemblies has not been well solved.In this paper,based on HTS data,in order to deal with BAM files with different CIGAR strings recording formats and analyze sequence alignment information more comprehensively and accurately,var Sig(verify alignment reference segment information generator)is proposed to parse BAM files based on htslib library.More importantly,focusing on the problem of misassemblies,this paper deeply analyzes the alignment information of paired-end reads,and proposes Misasm(misassembly detector),an efficient method for misassembly identification for HTS data.The main research contents of this paper include:(1)Var Sig is proposed to parse BAM files of sequence alignment.First,according to the different recording formats of CIGAR strings,BAM files are classified by the var Sig,making it compatible with different types of BAM files and extracting sequence alignment information in a targeted manner.Then,base,reads,contigs/scaffolds and other relevant sequence alignment information are obtained.Finally,coverage,insertion,deletion,clipping,pairing orientation and distance of paired-end reads and other information are counted,and sequence features related to misassemblies are analyzed.(2)Misasm is proposed to identify misassemblies.Misasm is based on var Sig,making full use of the paired-end reads information to extract the relevant features of various sequence misassemblies.The algorithm design adopts multi-threading technology to achieve more efficient detection of misassemblies.According to characteristic indicators such as clipping errors,abnormal coverage,opposite pairing orientation and abnormal insert size of paired-end reads,too many insertion events,and too many deletion events,misjoin and indel errors are concretized by Misasm.Then,misjoin and indel errors are identified through feature extraction and refined calculation.Furthermore,two sets of experiments in Escherichia coli and Homo sapiens Rych?chrom?14 are designed to verify the effectiveness of Misasm in identifying misassemblies.
Keywords/Search Tags:high-throughput sequencing technology, genome assembly, misassembly, paired-end reads
PDF Full Text Request
Related items