Font Size: a A A

Characterization And Machine Learning Prediction Of Allele-Specific DNA Methylation

Posted on:2016-06-30Degree:MasterType:Thesis
Country:ChinaCandidate:J L HeFull Text:PDF
GTID:2180330464473186Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Both genomic imprinting and the sequence differences between alleles may result in allele-specific DNA methylation (ASM). The number of imprinted loci, or so-called parent-of-origin dependent ASM (P-ASM), was found to be limited in the genome, but a large reservoir of sequence dependent ASM (S-ASM) loci have been discovered. Based on the methylation statuses of neighboring CpG, S-ASM CpG sites may be further classified into scattered-S-ASM (sS-ASM) and clustered-S-ASM (cS-ASM). Recent advances provided sufficient validated data to support the development of statistical models for ASM classification and prediction.In this study, we collected 1,952 P-ASM sites,9,030 cS-ASM sites,122,735 sS-ASM sites. In the analysis of methylation level distribution, we found that three types of ASM events are with distinct characteristics of methylation patterns. P-ASM and cS-ASM CpG sites are both enriched in CpG rich regions, promoters and exons, while sS-ASM CpG sites are enriched in simple repeat and regions with high frequent SNP occurrence. Using lasso-type of approach which is better for selection of predictors than SVM-based Recursive Feature Elimination (SVM-RFE), we selected 21 out of 282 features that are powerful in distinguishing cS-ASM CpG sites to train the classifiers with machine learning techniques. Based on 5-fold cross-validation, the logistic regression classifier was found to be the best with an ACC of 0.77, an AUC of 0.84 and a MCC of 0.54 for the prediction of cS-ASM CpG sites on mouse brain autosomes. Further, as only limited to functional regions to predict cS-ASM, we found that the classification performance in Repeat region is the best with an AUC of 0.89 and the predictive accuracy in CGI region is the best with an ACC of 0.83. Lastly, we applied the logistic regression classifier on human brain methylome which trained in mouse methylome to achieve a PPV of 0.85 and a FPR of 0.06 and identified 608 genes with the predicted cS-ASM sites. GO term enrichment analysis indicated that cS-ASM associated genes are significantly enriched in the ones coding for transcripts with alternative splicing forms.This study provided an analytical procedure for cS-ASM prediction and shed new light on the understanding of cS-ASM event.
Keywords/Search Tags:ASM, SNP, machine learning algorithms, logistic regression classifier
PDF Full Text Request
Related items