Font Size: a A A

Identification Of Disease-Related Genes Based On The Machine Learning

Posted on:2021-05-03Degree:MasterType:Thesis
Country:ChinaCandidate:X Y ChenFull Text:PDF
GTID:2404330611495923Subject:Medicinal chemistry
Abstract/Summary:PDF Full Text Request
Illness is a serious threat to human health and life.It is also a common challenge faced by governments and medical technology research and development institutions.At present,precision medicine is developing rapidly in China,and mining disease-related genes,exploring gene functions,and comprehensively understanding the pathogenesis of diseases is the only way to move towards precision medicine.Identifying disease-related genes from the human genome is an important and challenging task in chemical,biological,medical,and pharmaceutical research.Identifying disease-related genes is the first step in uncovering the molecular basis of a disease and can help raise awareness of gene-function interactions and biologically-related pathways.At the same time,identifying disease-related genes is one of the important steps to understand the pathogenesis and find therapeutic targets.It can help solve important problems in the field of system medicine,such as: disease etiology research,new drug development and drug design.Disease symptoms and protein sequence information are important resources to understand the complex relationship between disease and genes.Using machine learning to identify disease-related genes can help researchers further narrow their search and optimize in-depth experiments to verify disease-related genes and accelerate the identification of disease-related genes.The main content of this thesis is as follows:1.Based on the research content of the subject,the importance of genes in disease occurrence,the significance of identifying disease-related genes in the biomedical and pharmaceutical fields,the research status of disease-related gene identification methods and machine learning technology are reviewed.2.Based on the deep convolutional neural network,a new method for predicting potential disease-related genes is proposed.First,the genes and diseases are characterized by the information of the primary structure of the protein and the clinical symptoms of the disease,respectively.Then,a twodimensional grayscale image is constructed using the clinical symptoms of the disease and the primary structure information of the protein to represent the association information between the disease and the gene.Finally,a deep convolutional neural network is used to construct a model to predict potential disease-related genes.Among them,the accuracy,sensitivity,specificity,precision,and Matthews correlation coefficient of the training set are 92.29%,91.52%,93.06%,92.95%,and 0.8459,respectively,the test set are 80.63%,80.12%,81.14%,80.95% and 0.6125,respectively.Experimental results show that the developed method has good classification prediction performance and good robustness.In addition,the prediction results related to endometriosis in the top 50 disease-gene association pairs predicted by the constructed model were verified by literature,molecular simulation,and enrichment analysis,respectively,indicating the effectiveness of the current method.This method provides a new method and idea for exploring the complex relationship between diseases and genes.3.Based on the high correlation between network topological characteristics and biological functional characteristics,this study proposes a disease-related gene identification method based on gene-disease heterogeneous network topological characteristics.First,a disease similarity network is constructed using known disease shared genes and disease clinical symptom information,and the disease similarity network and protein interaction network are integrated into a gene-disease heterogeneity network through known diseasegene association information.Using the “guilt by association” principle as a theoretical guide,the node features in the network are weighted.Then,based on graph theory,the topological features of heterogeneous network nodes are extracted.Through protein sequence alignment and disease similarity calculation,a reliable negative sample data set is screened.Finally,a random forest algorithm is used to construct a model to identify potential disease-gene associations.The accuracy,sensitivity,specificity,precision,and Matthews correlation coefficient of cross-validation were 96.45%,93.65%,99.25%,99.20%,and 0.9304,respectively.Experimental results show that the developed method has excellent classification prediction performance.For lung cancer,leukemia,Alzheimer's disease and vitiligo,predictive models are constructed,respectively.Most of the first ten prediction results have been confirmed by the literature.The molecular simulation technology is used to further confirm the correlation between the gene ADA and Alzheimer's.This indicates that current methods can be effectively used for the prediction of disease-gene associations.
Keywords/Search Tags:Disease symptoms, Protein sequence, Machine learning, Convolutional neural network, Molecular simulation
PDF Full Text Request
Related items