Font Size: a A A

Research Of Computation And Analysis Methods For DNA Methylation Data

Posted on:2022-11-02Degree:DoctorType:Dissertation
Country:ChinaCandidate:Q TianFull Text:PDF
GTID:1480306764959109Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
Epigenetics is the study of molecular processes that influence the flow of information between a constant DNA sequence and variable gene expression patterns.DNA methylation,as a crucial epigenetic modification,regulates gene expression by changing chromatin structure,DNA stability and the interaction between DNA and protein.The development of DNA methylation detection technology has accumulated a large amount of data,and the methylation data generated by different detection techniques characterize the biological mechanisms of this epigenetic modification from different perspectives,which provides new opportunities to deeply reveal the regulatory mechanisms of DNA methylation.Based on the available DNA methylation data,this dissertation explores the biological significance of DNA methylation and dissects the biological value contained in DNA methylation data by developing corresponding computation and analysis methods.The main research contents include the prediction and modeling of genome-wide site-level methylation,the identification of differentially methylated loci and the clustering analysis of single-cell methylation data.The details are as follows:(1)Aiming at the problem that the previous prediction strategies for DNA methylation need to manually screen or predefine the features,a prediction model for the regression of genome-wide site-level methylation based on the deep convolution neural networks is proposed(MRCNN).Based on the linkage between the target Cp G loci and adjacent DNA sequence patterns,MRCNN takes the target Cp G loci as the centre and encodes their adjacent DNA sequences into sparse matrices as input.The predictive features related to site-level methylation are extracted by constructing a two-dimensional convolutional neural network for DNA sequence patterns,and the regression output of target Cp G loci can be achieved by combining with a continuous loss function.The experimental results on multiple datasets showed that MRCNN could predict the genomewide site-level methylation status more accurately than the previous methods.Moreover,by analyzing the DNA sequence features automatically learned by MRCNN,the results showed that some sequence motifs related to methylation status were identified and part of them significantly matched with known annotated motifs might play a key role in the regulation of DNA methylation.(2)Aiming at the problem of poor robustness of previous methods in identifying differentially methylated loci(DML),a hybrid ensemble feature selection approach is proposed to identify robust differentially methylated loci(Hy DML).By comprehensively considering the functional diversity and data diversity in the ensemble feature selection strategy,Hy DML utilizes a variety of basic feature selection algorithms to obtain the potential DML subsets on multiple data subsets and then achieves the identification of robust DML through the aggregation functions.When applied Hy DML to 13 cancerrelated methylation datasets,the DML identified by Hy DML could more accurately distinguish the normal and cancer samples than other methods and presented better robustness.Furthermore,the comprehensive analysis of the robust DML identified by Hy DML revealed that different types of cancers had similar methylation patterns,and the robust DML shared in many types of cancers could be regarded as potential pan-cancer biomarkers.(3)Aiming at the problem that the previous methods for clustering single-cell methylation data only rely on a single distance measure to describe the methylation differences between cells,resulting in limited clustering performance,a multi-distance based spectral embedding fusion approach for clustering single-cell methylation data is proposed(SINCEF).SINCEF utilizes spectral embedding and matrix fusion to integrate multiple methylation distance relationships between cells that are defined by different basic distance measures into a new distance measure,quantifying the cellular heterogeneity with higher resolution and then identifying cell types combined with the hierarchical clustering algorithm.The experimental results on several real single-cell methylation datasets showed that SINCEF significantly improved the clustering accuracy compared with the methods based on the single distance measures.Moreover,benefiting from the new distance metric,SINCEF could conveniently assess the cell subpopulation structure intuitively based on the cell-to-cell distance matrix,boosting the readability of cell clustering results.(4)Aiming at the problem of unstable clustering performance on different datasets of previous single-cell methylation data clustering methods,an enhanced consensusbased clustering model for single-cell methylation data is proposed(sc Melody).Based on the basic clustering results generated by various cell-to-cell similarity measures,sc Melody utilizes the proposed regularization strategy and dual weighting strategy to improve the construction of the consensus matrix in traditional consensus clustering,so as to reconstruct the methylation similarity patterns between cells for clustering.The experimental results on multiple real datasets and synthetic datasets showed that sc Melody achieved more advanced clustering performance than the previous methods and exhibited better clustering stability on datasets with different cell numbers,cluster numbers and Cp G dropout proportions.Furthermore,real case studies showed that sc Melody could identify rare cell clusters on large datasets with complex cell composition contexts.
Keywords/Search Tags:DNA Methylation, Prediction and Modeling, Differentially Methylated Loci, Single-cell Methylation Data Clustering
PDF Full Text Request
Related items