Font Size: a A A

Simulation And Application Of Genomic Structure Variation Research Based On Next Generation Sequencing

Posted on:2020-01-28Degree:MasterType:Thesis
Country:ChinaCandidate:M H GaoFull Text:PDF
GTID:2370330602950552Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of next-generation sequencing in recent years,many detection algorithms have been proposed to identify genome structural variations.Simulating genome sequence with structural variations and generating reads with real sequencing features can provide a necessary benchmark for these calling algorithms.Genomic structural variations and single-nucleotide polymorphisms constitute the vast majority of human genome variations.Simulating these variations in human genome and generating corresponding sequencing reads can provide an answer to performance assessment of the alignment and detection algorithms.However,it is not easy to capture the actual features and achieve variation simulation due to the complexity of structural variations and the uncertainty of reads generation of real sequencing data.None of the existing simulation algorithms can simulate all the features related to the actual sequencing data.To overcome this limitation,this paper proposes a new simulation algorithm,SVSR,which integrates several key features related to major genome variations and real sequencing data.SVSR can simulate five common genome structural variations(insertions and deletions,tandem duplications,copy number variations,inversions and translocations)and single-nucleotide polymorphisms,and it can generate reads based on four popular sequencing platforms(Illumina,Solid,Roche 454 and Ion Torrent).The implementation of SVSR includes the following two parts:(1)SVSR propose a novel simulation algorithm based on genomic structural variation.A variety of complex genomic variations are modeled and analyzed by using a hotspot distribution model,a selection model and a tumor heterogeneity model.First,SNPs are simulated.The distribution of variation hot spots,the ratio of homozygous/heterozygous and transition/transversion are analyzed.Second,indels are simulated.The distribution of variation hot spots,the indels of different variant lengths,the different variant types and the different sources of the insertion data are analyzed.Third,CNVs are simulated.The transition probability between the mutation states is analyzed,and the selection model is used to determine the probability value.Fourth,tandem duplications are simulated.Two duplication rules are used to generate repeat sequences.Fifth,inversions and translocations are simulated.The specific variation is simulated according to the demand.Finally,by synthesizing the above simulation parts,germline and somatic variations can be simulated to generate specific heterogeneous tumor data.(2)SVSR proposes a sequence generation algorithm based on real sequencing data.The quality value distribution model and the GC bias model are used to model and analyze the sequence generation process of specific tumor purity to generate normal samples and multi-tumor samples sequencing sequences.First,determine sequencing information such as read length,insertion size,and sequencing depth.Specific sequencing information values are determined for different sequencing platforms.Second,the sequencing quality value distribution and sequencing errors were analyzed.SVSR determines the distribution of quality values by training the real data of each sequencing platform and simulates the real sequencing error rate by an error model.Third,GC bias is analyzed.GC bias refers to the degree of deviation of the number of sequencing reads compared to the depth of sequencing,which is caused by changes in the GC content of DNA fragment.SVSR uses a linear relationship to simulate this deviation.During reads generation,specific normal samples,tumor samples or mixture samples are generated by training the quality values from real data and seting appropriate features.In summary,SVSR is a powerful simulation tool which integrates variations simulation and reads generation.It can simulate normal samples and tumor samples with multiple variations and generate sequencing reads for related sequencing platforms.By analyzing the experimental results,it can be found that SVSR can simulate more realistic data characteristics within a reasonable range of sequencing quality and has many advantages:(1)it has simulated many types of variations(six kinds genomic variations);(2)it takes into account the distribution of variation hot spots,homozygous/heterozygous and transition/transvertion;(3)it consider the different sources of insertion data;(4)it simulate tumor heterogeneity and tumor purity;(5)it simulate GC bias and quality distribution during sequencing,etc.In a word,SVSR has a unique ability for simulating complex structural variations and generating various sequencing reads.It can be used as a complement to existing simulation tools.It can also be used as a benchmark for mutation detection and alignment algorithms.This will help users choose appropriate methods to meet their requirements and help researchers develop more powerful mutation detection and alignment algorithms based on an understanding of the shortcomings of existing methods.
Keywords/Search Tags:genome structural variations, variation hot spots, tumor heterogeneity, selection model, reads generation, tumor purity, quality distribution
PDF Full Text Request
Related items