Font Size: a A A

Detection Of Copy Number Variants Based On Genome Sequencing Data

Posted on:2018-12-06Degree:DoctorType:Dissertation
Country:ChinaCandidate:R J TanFull Text:PDF
GTID:1360330566498882Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of genome sequencing technology,personal genome sequencing has gradually become one of the main approaches to diagnose diseases,develop treatments,build health management and explore the mysteries of life.It has greatly promoted the development of genetics,genomics,medical science and other related areas.Meanwhile,more and more scientific researches have shown that copy number variation(CNV),as an important structural variation,is closely related to the evolution,biodiversity,a variety of complex and rare diseases.Therefore,it is very important to explore the natural laws of organisms,reveal the mysteries of life,understand the mechanism of disease,and find out the pathogenic targets of diseases.However,due to the high complexity of the human genome itself,the large amount of data in the sequencing data and the technical limitations of the current sequencing technology,how to quickly and effectively identify and analyse the copy number variants(CNVs)is facing great challenges.This thesis mainly focuses on the detection of CNVs from genome sequencing data,and carries out relevant researches.The goal of this thesis is to evaluate the whole-exome sequencing(WES)CNV calling methods,and to develop new methods which can achieve higher sensitivities and specificities compared with current algorithms.At the same time,this thesis also provides a new entropy based method to detect and analyse human genome duplicated sequences.The main research contents of this thesis are as follows:First,the current WES CNV calling algorithms are not clear in the real sequencing data.Specifically,there is no systematic evaluation criterion at present.In this thesis,a series of WES CNV evaluation methods were proposed and four current WES CNV calling methods were evaluated by these measures standards.This evaluation study can provide theoretical basis for differential scientific experiments of scientists in different area.Meanwhile,it can also lay the foundation of developing new WES CNV calling methods in the future.Second,the identification results of existing WES CNV calling methods are not ideal,a pooled-sample based WES CNV method is proposed.This method firstly uses principal component analysis(PCA)to denoise WES data.Then,this method integrates both read depth(RD)and SNV information together as the paired input singles of hidden Markov model(HMM).Third,in order to further enhance the efficiency of identifing CNVs from WES data,a hybrid approach of CNV detection from WES data is proposed.Firstly,A single-sample based WES CNV calling method is proposed,which aims to avoid the problem of excessive noise reduction from pooled-sample based model.The single-sample based model employs a median method to normalize those known source biases and uses negative binomial distribution to fit the normalized RD signal.Then,a paired HMM is proposed to identify CNVs by using RD and SNV information.Finally,a merging algorithm is proposed to integrate the results of both pooled-sample based method and single-sample based method into the final CNVs result.Fourth,a generalized topological entropy is proposed to analyse duplicated genome sequence.The relationship between generalized topological entropy and topological entropy is proved mathematically.The generalized topological entropy is applied to analyse genomic elements,segmental duplication in human reference genome and short tandem repeats in personal genome.This is a new dimension to view and understand duplicated genome sequence.Meanwhile,it also supplies a new idea and method to precisely identify copy number duplications in the future.In conclusion,this thesis provides a series of comprehensive and objective criterions to evaluate CNVs results identified from WES data.A new pooled-sample based and a new hybrid approach of WES CNV calling methods are proposed by integrating both RD and SNV information into a paired HMM in this thesis.These two methods achieve better sensitivities and specificities with high practical significance and application value.A generalized topological entropy based duplicated genome sequence detection method is proposed and applied to genomic elements,segmental duplication as well as short tandem repeats,which has certain theoretical and practical significance.
Keywords/Search Tags:genome sequencing, whole-exome sequencing, copy number variation, hidden Markov model, generalized topological entropy
PDF Full Text Request
Related items