Font Size: a A A

Research On Genomic Reads Mapping Based On De Bruijn Graph Model

Posted on:2020-08-31Degree:DoctorType:Dissertation
Country:ChinaCandidate:H Z GuoFull Text:PDF
GTID:1360330590472806Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of sequencing technology and its gradual cost reduction,individual genome sequencing has become the main approach to study the genotypes of different species,variation knowledge and the related diseases.The bioinformatics can provide novel idea of exploring the life activity,understanding the mechanism of diseases and the treatment of diseases.It also greatly promotes the development of molecular biology,genomics,genetics and medicine.The genomic read mapping,as a basis of the genomic data analysis,is very important to the Variant Calling,gene expression analysis,alternative splicing study,biological network computation and other related areas.Restoration of the real genomic positions of the sequencing data is the foundation of subsequent biological computation.However,due to the massive repetitive sequences and high complex genomic regions,the ever-increasing sequencing data size and the technical limitations of sequencing technology,how to effectively and efficiently map the amount of reads to reference genomes is still facing the great challenges.This thesis mainly focuses on the genomic reads mapping and sequences alignment.The aim of this thesis is to evaluate current read alignment methods,and to propose an approach of non-linear representation and organization of genomes.This thesis presents a de Bruijn graph model-based genome index to organize a large number of repetitive sequences in genomes.Meanwhile,in order to rise up the application value of graph model,this thesis presents a novel de Bruijn graph construction method to build graph for big dataset.This thesis develops a de Bruijn graph model-based read alignment algorithm which can achieve higher accuracy,sensitivity and speed.Moreover,a variation aware read alignment algorithm is presented for an improvement of read alignment on the high complex region.The research contents are as follows:(1)This thesis introduces the hash table-based genomic data storage and indexing method and the basic idea of seed-and-extension scheme.A de Bruijn graph-based indexing structure named as RdBG and its three-level storage mode are proposed.Moreover,several basic corresponding operations are put forward based on the index characteristics.It demonstrates that this structure can effectively organize and index the repetitive sequences on the genomes in such a way that the number of candidate seeds can be greatly decreased.(2)Due to the genomic data of multiple species such as metagenomics and ever-increasing individual sequencing data size,we present a de Bruijn graph construction and compaction method deGSM based on the strategy of external sorting.deGSM achieves to construct the graph for arbitrarily sized data in scalable memory and solves the problem of prohibitively large memory usage of traditional graph building methods which always restrain the data size.It.Simultaneously,deGSM constructs the BWT(Burrows-Wheeler Transformation)of all the unitigs by taking advantage of the relationship between suffix tree and de Bruijn graph.The deGSM method can play an important role in the data analysis based on de Bruijn graph-based methods of big data processing and compression storage.(3)This thesis presents a seed-and-extension-based read alignment algorithm and develops the read aligner deBGA taking advantage of the de Bruijn graph model index.Firstly,the whole flow of deBGA together with the heuristic cyclic process is proposed.Then the concept of Uni-MEM seed is introduced and the computation mode of seeds merging and filtering for different situations are proposed.Meanwhile,we benchmark the deBGA on the dataset of multiple genomes from the same and different species,also on the real dataset and simulation dataset of human genome.We perform analysis on the alignment result of deBGA on different datasets comparing with other aligners using various parameters.Next,we observe the effect of deBGA on the subsequent Variant Calling.All of results show that the RdBG structure-based sequence alignment method performs better accuracy,sensitivity and higher speed.The deBGA can be treated as the candidate tool of genomic reads mapping.(4)The method of variation-aware read alignment is proposed in this thesis.Firstly,we design a pseudo tree structure that is composed of all local sequences and their associated variants in order to support the extension task.Then,we propose a pseudo tree structure-based local sequence alignment algorithm VARA that is to benefit from Landau-Vishkin method.Compared to the traditional Variant Graph-based methods that always have a huge memory consumption,the VARA provides a light-weight solution to integrate the variant knowledge into mapping process.We develop a novel variation-aware read mapping tool deBGA-VARA by integrating VARA into deBGA.With its efficiency,the VARA achieves better accuracy and sensitivity than other state-of-the-art read aligners.This thesis comprehensively summarizes the methods of genomic read alignment and provide a de Bruijn graph-based genome index to organize the repetitive sequences of genomes.To thoroughly solve the problem of memory bottleneck during the process of building a de Bruijn graph model for the big data,an external sorting-based graph construction algorithm is proposed,which is very important to the research of graph indexing and assembly methods.Meanwhile,this thesis presents a graph-based sequence alignment algorithm,which has been proven to work well in various datasets by a large number of experiments and has very high practical significance.Due to further promote the alignment accuracy and sensitivity,a novel local sequence alignment method integrating with variant information is proposed,which has a great theoretical and application value to the research of Variant Graph and sequence alignment algorithm.
Keywords/Search Tags:high-throughput sequencing data analysis, read alignment, genome sequence index construction, de Bruijn graph
PDF Full Text Request
Related items