With the rapid development of genomics,a large number of gene-related data have been accumulated,which makes people have a deeper and deeper understanding of non-coding regions.In this process,mutations in non-coding regions have also been found to be closely related to various diseases.However,due to the lack of research methods,researchers have made slow progress in the field of non-coding mutations.Disease association analysis based on biological information analysis is currently an effective means to study mutations in non-coding regions that affect phenotypes.This method starts from disease phenotypes by establishing disease-related biological networks,such as protein-protein interaction network and protein-RNA-DNA interaction network.From the biological perspective of "function,structure and sequence",we explored the non-coding mutations associated with diseases,and then determined the functions of non-coding mutations.However,many diseases are caused by a combination of multiple pathogenic factors,and the relationship between phenotypes and genotypes is very complex,which brings more challenges to the study of non-coding mutations.Rather than coding mutations by influencing gene expression to regulate biological phenotype,and produce different effect,gene expression and transcription factors,histone modification and chromatin accessibility of epigenetic information such as regulation,based on this,this paper,taking "sequence,expression,function" train of thought to study the noncoding mutations associated with the phenotypic forecast,the main research work is as follows:(1)Based on machine learning model LS-GKM(Gapped K-MERS SVM for large-scale Datasets),transcription factors in chromatin feature dataset were classified.During the training process,the chromatin characteristic data including transcription factor,DNase i and histone markers were processed by one-Hot coding and nucleotide density distribution characteristic coding,and the chromatin characteristic data set was obtained.Compared with GKM-SVM(GAPPED K-MER SVM),the LS-GKM model improved the median AREA under ROC(AUROC)by 3.5%in the feature classification of transcription factors.(2)This paper presents an improved deep learning model C-Net based on DeepSea.C-Net deepens the convolution depth of the model and introduces a batch layer.Combined with silicon saturation mutagenesis,phenotypic related non-coding mutations are predicted for single nucleotide sensitivity.In this study,we compared the performance of machine learning model LS-GKM and deep learning model C-Net on the classification of transcription factors in the chromatin feature dataset,and found that C-Net obtained higher AUROC values than LS-GKM for most chromatin features.The results of motif analysis showed that C-Net could predict non-coding mutation well.(3)This paper presents a deep learning model AC-Net based on C-Net improvement.AC-Net introduced spatial and channel self-attention mechanisms to efficiently extract sequence channel and spatial information,and combined with silicon saturation mutagenesis to predict phenotypic related non-coding mutations in single nucleotide sensitivity.In this study,we compared the classification performance of deep learning models C-Net and AC-Net on chromatin data with a sequence length of 1000bp.We found that although the median AUROC value of C-Net and AC-Net did not change significantly,the median AUPRC value of AC-Net increased by 3.6%compared with that of C-Net.Furthermore,Wilcoxon rank-sum test showed that the proposed model could preferentially predict phenotypic related non-coding mutations.(4)The influence of chromatin characteristic data sets based on different sequence lengths on model prediction was also studied.It is found that the LS-GKM model based on machine learning method is not suitable for the long sequence data set due to the influence of its kernel characteristics.However,the model based on deep learning method performs better on the data set constructed from long sequences than that constructed from short sequences,and the median AUPRC value of C-Net on the data set of 2000bp is 6%higher than that on the data set of 1000bp.The model constructed in this paper based on deep learning not only performs well in the short chromatin feature data set,but also performs well in the long chromatin feature data set.Compared with the classical machine learning model,the final median AUPRC value improves by 3.6%and 6%,respectively.The results show that the proposed model can predict phenotypic associated non-coding mutations well. |