Prediction Of Phenotypic Associated Non-coding Mutations Based On Deep Learning

Posted on:2023-07-01

Degree:Master

Type:Thesis

Country:China

Candidate:H L Jiang

Full Text:PDF

GTID:2530306941993929

Subject:Biomedical engineering

Abstract/Summary:

PDF Full Text Request

With the rapid development of genomics,a large number of gene-related data have been accumulated,which makes people have a deeper and deeper understanding of non-coding regions.In this process,mutations in non-coding regions have also been found to be closely related to various diseases.However,due to the lack of research methods,researchers have made slow progress in the field of non-coding mutations.Disease association analysis based on biological information analysis is currently an effective means to study mutations in non-coding regions that affect phenotypes.This method starts from disease phenotypes by establishing disease-related biological networks,such as protein-protein interaction network and protein-RNA-DNA interaction network.From the biological perspective of "function,structure and sequence",we explored the non-coding mutations associated with diseases,and then determined the functions of non-coding mutations.However,many diseases are caused by a combination of multiple pathogenic factors,and the relationship between phenotypes and genotypes is very complex,which brings more challenges to the study of non-coding mutations.Rather than coding mutations by influencing gene expression to regulate biological phenotype,and produce different effect,gene expression and transcription factors,histone modification and chromatin accessibility of epigenetic information such as regulation,based on this,this paper,taking "sequence,expression,function" train of thought to study the noncoding mutations associated with the phenotypic forecast,the main research work is as follows:(1)Based on machine learning model LS-GKM(Gapped K-MERS SVM for large-scale Datasets),transcription factors in chromatin feature dataset were classified.During the training process,the chromatin characteristic data including transcription factor,DNase i and histone markers were processed by one-Hot coding and nucleotide density distribution characteristic coding,and the chromatin characteristic data set was obtained.Compared with GKM-SVM(GAPPED K-MER SVM),the LS-GKM model improved the median AREA under ROC(AUROC)by 3.5%in the feature classification of transcription factors.(2)This paper presents an improved deep learning model C-Net based on DeepSea.C-Net deepens the convolution depth of the model and introduces a batch layer.Combined with silicon saturation mutagenesis,phenotypic related non-coding mutations are predicted for single nucleotide sensitivity.In this study,we compared the performance of machine learning model LS-GKM and deep learning model C-Net on the classification of transcription factors in the chromatin feature dataset,and found that C-Net obtained higher AUROC values than LS-GKM for most chromatin features.The results of motif analysis showed that C-Net could predict non-coding mutation well.(3)This paper presents a deep learning model AC-Net based on C-Net improvement.AC-Net introduced spatial and channel self-attention mechanisms to efficiently extract sequence channel and spatial information,and combined with silicon saturation mutagenesis to predict phenotypic related non-coding mutations in single nucleotide sensitivity.In this study,we compared the classification performance of deep learning models C-Net and AC-Net on chromatin data with a sequence length of 1000bp.We found that although the median AUROC value of C-Net and AC-Net did not change significantly,the median AUPRC value of AC-Net increased by 3.6%compared with that of C-Net.Furthermore,Wilcoxon rank-sum test showed that the proposed model could preferentially predict phenotypic related non-coding mutations.(4)The influence of chromatin characteristic data sets based on different sequence lengths on model prediction was also studied.It is found that the LS-GKM model based on machine learning method is not suitable for the long sequence data set due to the influence of its kernel characteristics.However,the model based on deep learning method performs better on the data set constructed from long sequences than that constructed from short sequences,and the median AUPRC value of C-Net on the data set of 2000bp is 6%higher than that on the data set of 1000bp.The model constructed in this paper based on deep learning not only performs well in the short chromatin feature data set,but also performs well in the long chromatin feature data set.Compared with the classical machine learning model,the final median AUPRC value improves by 3.6%and 6%,respectively.The results show that the proposed model can predict phenotypic associated non-coding mutations well.

Keywords/Search Tags:

noncoding mutation, machine learning, deep learning, mutation prediction, attention

PDF Full Text Request

Related items

1	Study On Computational Modeling Of Protein Mutation Pathogenicity
2	Applications Of Machine Learning In Biological Sequence
3	Assessment And Preliminary Performance Improvement Of Cancer Driver Missense Mutation Prediction Methods
4	Prediction Of Deleterious Synonymous Mutation Based On Undersampling Scheme
5	Subcellular Localization Prediction For RNAs And Proteins Based On Machine Learning And Deep Learning
6	Precipitation Forecast Spatiotemporal Sequence Prediction Research Based On The Fusion Of Deep Learning And Ensemble Learning
7	The Analysis Of Machine Learning-Assisted Mutation Evolution Of Creatinase
8	Effect Of Mutation On Linkage Disequilibrium And Genotype Inference And Its Detection By Machine Learning Methods
9	Research On RNA Molecular Secondary Structure Prediction Based On Machine Learning
10	Prediction Of Plant Long Noncoding RNAs Interactions With Proteins By Deep Learning