Font Size: a A A

Detection Of Exome Copy Number Variation Based On Hidden Markov Model

Posted on:2016-11-23Degree:MasterType:Thesis
Country:ChinaCandidate:C GuoFull Text:PDF
GTID:2284330470457905Subject:Biomedical engineering
Abstract/Summary:PDF Full Text Request
Copy number variations (CNVs) play a crucial role in both cancer and non-cancer diseases. The development of next generation sequencing (NGS), including whole-genome sequencing (WGS), whole-transcriptome sequencing (RNA-seq) and whole-exome sequencing (WES) has provided powerful platforms and techniques for the detection of CNVs. Read depth is an important and widely used signal in the data analysis of sequencing based experiments. However, it is quite challenging to detect CNVs using raw read depth due to signal distortion caused by guanine-cytosine content (GC content), mappability and exon length. Besides, exons are sparely and non-uniformly distributed in the genomics. In this study, we proposed a new method for the detection of WES data. First, we define a new signal:relative read depth (RRD) and explore its statistic properties. We find RRD can be modeled properly by an empirical formula, which makes statistic modeling and parameter optimization process much easier. Besides, RRD show little correlation, compared to raw read depth, with bias sources such as GC content, mappability and exon length, which makes it easier to use. On the basis of RRD, we build a hidden Markov model (HMM) and use expectation-maximization algorithm (EM algorithm) to optimize its parameters. Finally, we use Viterbi algorithm to estimate the copy number of each exons and call the CNV regions. To provide a useful tool for researchers interested in finding CNVs, we developed a software, ExomeHMM, based on this statistic model. To evaluate the performance of our algorithm, we first analyzed WES data in1000Genome Project and use experimental identified CNVs as golden standard. When compared with other CNV detection algorithms, ExomeHMM achieves the highest overall performance, measured by F-score. To test our approach on clinical data, we applied ExomeHMM on triple-negative breast cancer data. As expected, we are able to identify genes that are significantly associated with breast cancer. In conclusion, our statistic model is able to detect CNV regions and report biological meaningful results on both healthy samples as well cancer samples.
Keywords/Search Tags:cancer, copy number variations, next generation sequencing, whole-exome sequencing, hidden Markov model
PDF Full Text Request
Related items