Font Size: a A A

Detection Of Genome Variants Based On Hight Throughput Sequencing Data

Posted on:2017-12-08Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y Z LiuFull Text:PDF
GTID:1310330536981008Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of high-throughput sequencing over the past decade,personal genome sequencing has been widely used in the field s of basic medical science,clinical diagnosis and treatment,health management and new drug discovery,and greatly promotes the development of these fields.Personal genome sequencing data is fragmented,high volume and highly complex,so the analysis of personal genome sequencng data faces great challenges.Genome variants refer to the difference among different individuals' s DNA sequences,and they can cause different phenotypes and diseases.Detecting different types of genome variants from personal genome sequencing data is of great significance to the wide application of personal genome sequencing in different fields.Detection of genome variants from highthroughput sequencing data is a hot and challenging topic of computer science and bioinformatics.As to detecting different types of genome variants from different types of personal genome sequencing data,new algorithms,softwares and tools emerge in endlessly.However,existing genome variant calling methods generally achieve low accuracies,which limits the wide application of personal genome sequencing in different fields.This thesis mainly focuses on genome variant calling on tumor-normal pair genome sequencing data and family genome sequencing data,and proposes detection methods for several types of genome variants that are difficult to detect,which aims to improve the detection accuracy of genome variants.The main contents are as follows.(1)Existing methods of read depth-based copy number variation detection usually cannot accurately model the read depth distribution.In this thesis,a probabilistic model of the read depth based on the negative binomial regression is proposed.This model can model the overdispersion in the read depth distribution and reflect the effect of the GC content and mappability on the read depth.This model can be used for copy number variation detection from the single sample genome,population genomes,tumor-normal pair genomes and family genomes.(2)Existing methods of copy number variation detection from tumor-normal pair genomes cannot detect germline copy number variations and somatic copy number alterations simultaneously,and usuallly achieve low accuracies for both variations.In this thesis,a hidden Markov model-based method of copy number variation detection from tumor-normal pair genomes is proposed.In this method,the combination of copy number states in the tumor and normal cell at one position is defined as the hidden state.The emission probability is calculated by the beta binomial-based allele frequency probability model and the negative binomial-based read depth probability model,in the meantime,tumor impurity and aneuploidy is included.The transition probability is calculated according to germline copy number variations' state transitions and somatic copy number alterations' state transitions.Finally,the Viterbi algorithm is used to infer the most likely sequence of hidden Markov states,and then germline copy number variations and somatic copy number alterations are detected.(3)Existing methods of copy number variation detection from single sample genome and population genomes usually achieve low accuracies on family genome sequencing data.In this thesis,a hidden Markov model-based method of copy number variation detection from family genomes is proposed.In this method,the combination of copy number states in a parent-offspring trio at one position is defined as the hidden state.The emission probability is calculated by the negative binomial-based read depth probability model.The transition probability is calculated according to copy number inheritance probabilities under the pattern of Mendelian inheritance and de novo event occuring.Finally,the Viterbi algorithm is used to infer the most likely sequence of hidden Markov states,and then inherited copy number variations and de novo copy number variations are detected.(4)Existing methods of de novo mutation detection from family genomes cannot handle the regions with alignment errors,so the false positive rate remains very high.In this thesis,a gradient boosting-based de novo mutation filtering method is proposed.This method can be applied to the results of common de novo mutation detection methods to significantly reduce false positive rate without sacrificing sensitivity.
Keywords/Search Tags:high-throughput sequencing, genome variant calling, copy number variation detection, de novo mutation filtering, hidden Markov model
PDF Full Text Request
Related items