Font Size: a A A

Detection Algorithms Of Genomic Copy Number Variation Based On Low Coverage Sequencing Data

Posted on:2021-04-23Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y Y LiFull Text:PDF
GTID:1480306050463654Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Copy number variation(CNV)is an important form of structural variation,and contains a lot of gene information,which plays an important role in human genetic diseases,rare diseases,tumors and other complex diseases.The research significance is slightly different for CNV with different sample scenarios(multiple samples,a pair of matched samples,and a single sample): Detection of recurrent CNV across multiple samples is helpful to study the human population genetics;Detection of CNV from a pair of matched samples is of great significance to study the occurrence,development mechanism and targeted drug treatment of diseases such as tumors.In addition,when lacking of control samples,detection of CNV from a single sample can provide a clinical auxiliary means to find the pathogenic genes of rare diseases.Next-generation sequencing technology has become the main platform for analysis of genomic variation via its high throughput and high speed,but its cost increases with the increase of sequencing coverage.In order to control the cost,low coverage sequencing data are often used in analyzing whole-genome wide CNVs.However,the read depth signal from low coverage data is very sensitive to systematic noise and sequence alignment bias,which may lead to more false positive calls in detection of CNV.Currently,the challenge is how to accurately detect CNVs from low coverage sequencing data with high resolution.Based on the low coverage sequencing data,in this dissertation,we propose a series of corresponding solutions and tools for the above-mentioned scenarios with different samples,which improves the accuracy of the detection results and reduces the false positive calls.It mainly includes the following three works:(1)A method SM-RCNV is proposed to detect the region of recurrent CNV across multiple samples,based on the correlation of genomic loci.Aiming at the problem of high falsepositive rates of existing CNV detection algorithms in low coverage data,we consider the correlation characteristics of the internal structure of CNV,build a new statistic combining the correlation characteristics with traditional sequencing read depth signals,and use a permutation test to determine the region of recurrent CNV across multiple samples.The statistic constructed in this chapter is the weighted sum of the correlation of genomic loci and read depth signal corresponding to the loci.In order to determine the weights,we divide the sequencing data with standard benchmark into CNV regions and non-CNV regions,and solve these weights by Fisher discriminant analysis algorithm.Compared with existing methods,SM-RCNV has a high sensitivity and specificity.(2)Studying the distribution characteristics of the ratio of read depth signal of the diseased and normal matched samples,we proposed a new method Bag GMM based on a Gaussian mixture model for the read depth ratios to detect CNVs.The main idea of this method is below: 1)A large sliding window is firstly used to divide the genome into segments to improve efficiency,and a small sliding window is applied to segment large segments with high variance to ensure the accuracy of variation boundary.Therefore,a segmentation strategy of " Large window first and then small window " is proposed;2)We use three Gaussian distributions to represent three copy number states(deletion,normal and amplification)by modeling with a 3-Gaussian mixture model for the read depth ratio of these segments;3)To reduce false positive calls,with the help of the bagging algorithm in machine learning,we construct multiple 3-Gaussian mixture models and summarize the detection results of these 3-Gaussian mixture models.Compared with four mainstream algorithms,Bag GMM maintains a stable and efficient detection result in terms of sensitivity and specificity,and is superior to four comparative methods,regardless of the change of sequencing coverage and CNV distribution in the data,especially in low coverage sequencing data.In addition,we also apply the proposed method to breast cancer patients and ovarian cancer patients,and get the same conclusion as the simulation.(3)An algorithm dpGMM of constructing a Dirichlet process Gaussian mixture model based on a two-dimensional profile combining read depth and genome position is proposed,fully considering the influence of sequencing biases on CNV detection in low sequencing coverage data from a single sample: 1)All kinds of sequencing data bias such as sequencing bias,alignment bias,and GC bias are corrected firstly,and read depth signals are smoothed;2)Considering the importance of genome position,the read depth signal is combined with its corresponding genome position,and the read depth signal in one-dimensional space is transformed into a two-dimensional profile to reflect the amplitude and location space of copy number respectively.In this way,the accuracy is further improved by analyzing the read depth data from horizontal and vertical perspectives;3)Assuming that the sequencing data is a mixture of multiple copies,where each copy number is regarded as a Gaussian model,we construct a Gaussian mixture model for the two-dimensional read depth signals.Without assuming the number of Gaussian components,we use Dirichlet process as a prior distribution,which improves the accuracy and reduces the false positive rate.Compared with existing methods,dp GMM always has a high sensitivity and a high specificity.In a word,we propose three CNV detection methods,which are suitable for three corresponding aspects with different samples: multiple samples,a paired of samples,and a single sample.The sensitivity and specificity of these three methods are not affected by the sequencing coverage.Especially in analysis of low coverage data,these three methods still maintain stable performance,and also have applicable clinical value.
Keywords/Search Tags:Copy Number Variation, Low Coverage Sequencing, Read Depth, Gaussian Mixture Model, Biases
PDF Full Text Request
Related items