Font Size: a A A

Prediction Method Of Gene Methylation Sites Based On LSTM With Compound Coding Characteristics

Posted on:2022-11-21Degree:MasterType:Thesis
Country:ChinaCandidate:Z Q WangFull Text:PDF
GTID:2480306779996479Subject:Biomedicine Engineering
Abstract/Summary:PDF Full Text Request
All traits of human growth and development are affected by both genes and the environment.Among them,genes are widely involved in regulating various physiological functions in the life process,and play a decisive role in biological traits.As a common gene epigenetic modification,methylation is closely related to gene expression and development of many major diseases in many biological phenomena.DNA-N6 methyladenine(6-m A)epigenetic modification is an important epigenetic modification one of the markers.The wrong expression of 6-ma gene will affect gene expression and lead to a variety of major diseases.Therefore,it is of great significance to predict the 6-m A gene sites.In this thesis,the application of deep learning model in 6-Ma methylation site prediction was studied.The main contents include:(1)To study the high-dimensional numerical representation method of gene sequence composed of four bases,to study the spatial characteristics and sequence characteristics based on high-dimensional data,and to form the gene expression coding method suitable for 41 bp gene composed of A,T,C and G bases;Combined with K-mer encoding method,this method increased the number and type of feature extraction of original gene sequence(encoding from 41×1 sequence to 40×16 matrix),and established a unique sequence coding method.(2)It defines the goal of mining high latitude time series characteristics on gene sequences,studies the functions and uses of each layer of deep learning model,analyzes the processing characteristics of LSTM on high-dimensional data,reasonably designs the connection of each layer in the model,tests the influence of different initialization functions and different optimizer functions on the prediction accuracy of the model with experiments,improves the design of LSTM function layer,and optimizes the performance of the model on gene sequence data set.A model suitable for the task of mining long sequence upper point information is established.(3)Based on the composite coding LSTM model,the design idea of transfer learning is integrated to optimize the cross species prediction of the model.The LSTM migration learning model for predicting 6m A sites across species is designed,and the migration learning is used to solve the problem of site prediction when the sample size of 6m A sites of a species is insufficient.(4)Researching on evaluation model of coefficient of performance indicators,the model output and the corresponding performance index evaluation strategy,based on the same data set under a variety of model experiment,the output control group from the model accuracy,sensitivity,special effects,the comprehensive evaluation index and model generalization ability in many aspects,such as site prediction ability evaluation model with 6 ma long sequence,Then,a possible data set containing potential methylation sites was created and input into the model for prediction,and the results were compared with a variety of online 6m A prediction tools to verify the reliability of the model.In this thesis,a long short-term memory neural network(LSTM)based on K-mer method and One-Hot method composite feature encoding is proposed for gene methylation site prediction.The K-mer coding method is used to increase the amount of sequence information,combined with the One-Hot coding method to form a composite coding matrix,and to increase the feature dimensions and types that the LSTM model can extract from the gene sequence data,so as to improve the processing performance of the LSTM model for gene sequences.The ten-fold cross-validation experiment results show that the method can achieve 93.7%accuracy on public datasets,and the sensitivity,specificity and Mahalanobis correlation coefficient are 93.0%,94.5% and 0.875.On methylation datasets of six different species,this method can obtain AUC values ranging from 0.9055 to 0.9262.It shows that the performance of this method is better than other traditional methods.When trained with a larger-scale dataset,the model can obtain higher prediction accuracy and can be applied to predict 6-m A sites.The prediction results of this method have been verified and supported by many tools,which provides a novel research direction for the prediction of 6-m A sites,and provides an effective solution for the prediction of methylation sites under limited data sets.It plays a theoretical computational aid and assistance in gene methylation prediction and research.
Keywords/Search Tags:Methylation Site Prediction, Deep Learning, Long and Short-term Memory Networks, Composite Features
PDF Full Text Request
Related items