Font Size: a A A

Research On Genomic Sequence Alignment Methods Based On High-throughput Sequencing Data

Posted on:2022-09-26Degree:DoctorType:Dissertation
Country:ChinaCandidate:W QuanFull Text:PDF
GTID:1480306569982729Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of high-throughput sequencing technology,the research of genomics,transcriptomics,proteomics,and other omics has made great progress,and effectively promoted the development of bioinformatics,genomics,and clinical medicine.In general,the first step in the research of high-throughput sequencing is mapping reads to a reference genome to reconstruct the genomic sequence and variations of the sequencing sample.Therefore,the study of sequence alignment algorithms for high-throughput sequencing data is of great significance to the analysis and interpretation.However,due to the large number of repetitive sequences in the genome,differences between the reference genome and the individual genome,and inevitable sequencing errors,current sequence alignment methods still have problems such as low alignment accuracy,low sensitivity,and slow mapping speed.There is an urgent need to develop more effective sequence alignment methods for high-throughput sequencing data.This thesis focuses on the problem of sequence alignment based on high-throughput sequencing data.To improve the sensitivity,accuracy,and alignment speed in the sequence alignment process,novel sequence alignment algorithms and tools are designed and developed,which can effectively solve the challenges faced by the current genome sequence alignment.The main contents include the following parts:(1)Considering the conventional Burrows-Wheeler Transform(BWT)indexing method cannot effectively support approximate matching of seed sequences,this thesis proposes a novel genome indexing method named fBWT,which is based on improved BWT structure.Firstly,The method constructs a multi-level hierarchical local index structure,named sBWT,for all fixed-length short repetitive sequences in the genome,so that the local index of each short repetitive sequence contains its local sequence information including predecessor sequence and successor sequence.Secondly,the global index of the genome is constructed by building the FM-index for a reference genome.Finally,the mapping relationship between the global index and local indexes of fixed-length short repeats is established by sBWT and the relation array,as well as the mapping relationship among the multi-level hierarchical local indexes.This indexing method can effectively support maximal approximate matches of seed sequences in the sequence alignment process,and improve the sensitivity and recall rate of seed sequences in the selection of candidate locations.(2)Considering the exact matching seeds cannot effectively search candidate locations in the approximate repeat sequences of the genome,this thesis proposes a sequence alignment method named MAM that is based on maximal approximate matching seeds.This method first searches maximal approximate matching seeds generated from reads via fBWT index,and preliminary filters candidate locations for sequence alignment.Then it uses a chaining method to filter candidate locations of the read to further shrink the set of candidate locations.Finally,the local alignments of reads against the reference genome are performed at the candidate locations,and the best alignment location is output.The sequence alignment method can effectively reduce candidate alignment locations in the approximate repetitive sequences of the genome,and effectively improve the sequence alignment speed.(3)Considering the index for reference genome cannot effectively represent the knowledge of population variations,this thesis proposes a population-genome indexing method,named SALT-index,that integrates reference genome and genomic variations.This method constructs a genome variation graph(referred to as variation graph)by integrating genomic variations into the reference genome.And it converts the variation graph into a primary reference genome and an alternative reference genome,and builds Ferragina Manzini(FM)indexes for the primary reference genome and the alternative reference genome respectively.The indexing method can effectively support the selection of candidate locations for seeds with variations,and improve the alignment performance for seeds in the region enriched in variations.(4)Considering the current reference genome oriented sequence alignment methods cannot effectively distinguish genomic variations and sequencing errors,this thesis proposes a sequence alignment method named SALT that supports variation-aware sequence alignment.This method first performs exact matching of seeds via the primary index of SALT-index that does not contain any variation information,and performs secondary exact matching via the alternative index of SALT-index that contains variation information for the left unmatched seeds.After chain-filtering and candidate location shrinking,the method supports two variation-aware pairwise alignment algorithms with different penalty strategies to extend the seeds,and further determine the best alignment locations.This sequence alignment method can distinguish genomic variations and sequencing errors contained in the sequencing sequences,and can effectively cope with the sequence alignment challenges brought by population genetic diversities.
Keywords/Search Tags:Bioinformatics, High-throughout sequencing data, Genome sequence alignment, Genome index construction, Genome variation knowledge
PDF Full Text Request
Related items