Font Size: a A A

Detection Of Tumor Copy Number Variation And Inference Of Subclonal Populations Based On Next-generation Sequencing Dat

Posted on:2023-03-30Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y GuoFull Text:PDF
GTID:1520306917479734Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Cancer is a disease caused by abnormal cell malignant proliferation and gene change.With the development of statistical theory and genome sequencing technology,exploring the evolutionary history of tumors has become a hot research topic.According to the hypothesis of tumor evolution,tumor cells come from a primitive cell and induce the expression of cancer-driving genes through gene mutation to generate new tumor cells.In this process,somatic mutations of genes provide important clues for the study of tumor evolution and tumor heterogeneity.Therefore,it is important for tumor heterogeneity analysis,tumor evolution,prediction of cancer prognosis and clinical treatment to infer subclonal population from a large number of genome data.The process of subclonal population inference includes four steps: variant allele frequency detection,copy number variation detection,variation ratio correction,and subclonal population clustering.Most of the existing methods only implement the last two steps of the process.However,different copy number variation detection methods have great influence on tumor evolution analysis.In addition,tumor subclonal population inference requires high quantity and quality of mutation information.The coverage of sequencing data used by existing methods is at least 20 times,and it takes a lot of time to calculate the copy number variation and subclonal population information of this data.Therefore,it is necessary to design a copy number variation detection and subclonal population inference method with high accuracy and fast calculation speed.A copy number mutation detection and sub clone population inference process is designed to solve these issues.The main work includes the following four aspects:1.A copy number segmentation method based on wavelet clustering is designed.The existing segmentation methods based on circular binary segmentation algorithm,mean shift,or outlier detection algorithm have certain limitations.Based on the circular binary segmentation algorithm,the mean shift will ignore the segments with small copy numbers or length,and the outlier detection algorithm cannot filter the influence of centromere and allele.To solve these issues,this paper used a wavelet clustering algorithm which keeps sensitive to the information of short copy number mutation segments and minor changes.Compared with other algorithms,the copy number segmentation method based on wavelet clustering separates the reading depth anomaly caused by centromere and allele from the differentiated variation area,which will not cause false detection.The experimental results show that the copy number segmentation algorithm based on wavelet clustering has the highest sensitivity under the same accuracy as other existing copy number detection algorithms.At the same time,it also maintains high sensitivity to cancer-driving genes.2.A copy number variation detection method based on the dynamic interval is designed.In the existing copy number mutation detection algorithms,most copy number detection methods do not balance the amplitude difference between gain and loss.Unbalanced signals will affect the effect of copy number variation detection,resulting in more false positive results,that is,lower accuracy.To solve these issues,this paper designs a copy number variation detection method based on dynamic interval.This method calculates the density of window reading depth in the interval based on the statistical interval of dynamic width,and distinguishes the starting and ending positions of copy number according to the density change.This method is fused with the results of wavelet clustering.In the detection results of wavelet clustering,the dynamic interval detection method is used to determine the starting and ending positions of copy number variation.The experiment shows that in the simulation data,the fusion results of the two have a high accuracy on the basis of ensuring the original sensitivity.In the experiment of real data,the fusion result of the two has the highest accuracy and sensitivity.In the cancer data,the detection result of the fusion of the two is the closest to the location of the cancer driving gene,with the smallest error.3.A correction method of tumor cell fraction was designed.The existing correction methods were generally based on the strict restriction of mutation diversity of unit point base mutation;This strategy adds mutation multiplicity without copy number variation correction to subsequent calculation,which may introduce unexpected deviation in result analysis.A new formula for calculating cancer cell fraction was proposed.Compared with the original formula,the number of regions with mutation multiplicity greater than the absolute copy number in the new calculation results is much less,which can adapt to changes in tumor purity and variant allele frequency.At the same time,the result of tumor cell fraction is more in line with reality than original formula,which applies to all purity and variation allele frequencies.The experimental results show that the calculated error of the new tumor cell fraction is generally smaller than that of the original method and closer to the ground truth.4.A subclonal population inference algorithm based on Elastic Net was designed.The existing subclonal population inference methods have a large amount of population inference computation.The number of subclones was preset in most methods to simplify the calculation and reduce the calculation time.However,setting the count of the population in advance has a great impact on the results.At the same time,the existing methods have not yet discussed whether the correlation between somatic mutations in the same reading segment will affect the calculation of tumor cell fraction.The tumor subclonal population inference algorithm based on the elastic network uses the penalty term of the Elastic-net to generate the relationship matrix between VAF and CCF.The new method improves the accuracy while maintaining the same computational efficiency as the existing methods.The experiments show that the subclonal population inference algorithm based on Elastic Net performs well in different cancer data.
Keywords/Search Tags:New Generation Sequencing, Whole Genome Sequencing, Copy Number Variation, Subclonal Population, Tumor Heterogeneity
PDF Full Text Request
Related items