Font Size: a A A

A Method For Detection Of Somatic Single Nucleotide Variants Based On Linkage Disequilibrium

Posted on:2023-07-07Degree:MasterType:Thesis
Country:ChinaCandidate:W X ChenFull Text:PDF
GTID:2530306911984149Subject:Engineering
Abstract/Summary:PDF Full Text Request
Single Nucleotide Variant(SNV)is one of the most common types of gene mutation and can be divided into two forms: germline variation and somatic variation,which are respectively the root causes of genetic diseases and various acquired cancers.Therefore,somatic-SNV detection provides important information for pathological analysis and personalized treatment of cancer and has become one of the important contents of cancer genome research at present.With the next-generation sequencing technology(NGS)providing a huge amount of high-resolution genome data,although many methods for detecting SNV have been produced,few methods can adapt to various scenarios.There is still much room for improvement in the accurate detection of somatic SNVs.At present,the main challenge of SNV detection lies in how to accurately distinguish SNVs with low allele frequency from various artifacts including background noise and alignment errors,and how to distinguish somatic SNVs from germline SNVs with similar allele frequency.This requires more sensitive statistical models and detection techniques.In this thesis,a somatic-SNV detection method based on linkage disequilibrium(LD),LDSSNV,is proposed.Linkage disequilibrium is a linkage relationship between mutations that makes mutations no longer completely independent and random,and it is a unique property among germline variations.At first LDSSNV method extracts the candidate SNVs,including true SNV and various artifacts.Afterward,for each candidate locus it extracts five characteristics related to SNV,namely read depth,allele frequency,copy number,number of mismatched reads,and sum mapping quality of mismatched reads,and establishes an extreme gradient boosting(XG Boost)model to predict all SNVs.Finally,through the single-sample mode and multi-sample mode respectively,it designs and calculates the LD-based indexes,and builds an XGboost classification model to distinguish somatic SNVs from germline SNVs.The Multi-sample classification mode measures LD by quantifying the frequencies of two forms of SNVs presenting in samples,can simultaneously discriminate somatic SNVs and germline SNVs in multiple tumor samples from the same population.The single-sample classification mode similarly measures LD by quantifying the frequencies of two forms of SNVs presenting on sequencing reads,to discriminate somatic SNVs and germline SNVs in a single tumor sample.To verify the performance of LDSSNV,in this thesis,several datasets of multiple samples with LD characteristics are simulated and generated,and real datasets from some tuberculosis patients are obtained.Experiments are conducted separately and four existing methods are performed for comparison.Simulation results show that the LDSSNV method achieves a balance between precision and sensitivity,and its multi-sample mode and single-sample mode are both superior to other methods in F1-score,especially for samples with low tumor purity.The real experimental results show that the multi-sample and single-sample modes of the LDSSNV method can complement each other,and detect more somatic SNVs overlapped with other methods.The experimental results validate the effectiveness of the LDSSNV method.We expect that the LDSSNV method can be used as a routine method to detect SNVs in somatic cells.
Keywords/Search Tags:SNV, Linkage Disequilibrium, XGBoost, Next-generation Sequencing, Allele Frequency
PDF Full Text Request
Related items