Font Size: a A A

Protein Sequences Design Based On Deep Generative Model

Posted on:2024-05-22Degree:MasterType:Thesis
Country:ChinaCandidate:Y LiFull Text:PDF
GTID:2530307121983709Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
In recent years,many novel approaches based on deep learning models have been used to design proteins by learning sequence-function relationships from protein sequences and screening data.The protein sequences contain a wealth of information,determining the protein’s folding state in three-dimensional space and its function in nature.However,most current protein models could better capture the interrelationships between distant sites on long protein sequences.It is not conducive to the model to fully capture the biological characteristics related to evolution,and it is also not conducive to the acquisition of new protein variants with better properties.Therefore,this thesis uses a deep generative model to study the coding representation of longer sequences and the optimal generation of proteins.The primary research is as follows:(1)For the problem that the encoding representation cannot fully capture the relationship between long sequence sites,this thesis proposes a protein sequence representation model based on Temporal Variational Auto-Encoders(TVAE).TVAE consists of an encoder and a decoder.Among them,the encoder uses dilated causal convolution to expand the receptive field of neurons in the network structure,which can improve the coding representation ability of longer sequences.The decoder decodes the sampled data into variants similar to the original input sequence.In the TVAE model,the encoder and decoder are combined to realize the encoding and decoding of long protein sequences.By comparing TVAE with other models in predicting protein fitness,the experimental results show that the TVAE model performs better in encoding long protein sequences,has a higher Pearson correlation coefficient with truth values,and has a lower mean absolute deviation(MAD).In addition,the protein sequences of different lengths are input into TVAE to compare the coding representation.The experimental results also show that the TVAE model has a better representation ability for longer sequences.These experimental results show that the TVAE model has apparent advantages in learning the representation of longer sequences.(2)For the problem that protein sequence generation is too different from natural sequences,this thesis proposes a model that fuses temporal variational autoencoders and generative adversarial networks called TVAE-WGAN.Although TVAE has improved the ability to encode and express longer sequences,the new variants generated still need improvement.Therefore,the TVAE-WGAN model is proposed in Section IV,which consists of three parts: encoder,generator,and discriminator.First,the encoder of TVAE is used to convert long sequence proteins into a low-dimensional continuous latent space representation.Then,the decoder of TVAE is used as the generator of WGAN,and the generator is trained by sampling from the latent space trained by TVAE.Finally,a discriminator is added after the generator of TVAE to improve the generation effect.The generated data and the truth data are input into the discriminator of WGAN to train the discrimination ability of the discriminator.Through the confrontation between the discriminator and the generator,the generation ability of the generator is further improved,making it close to the natural sequence.Experiments show that TVAE-WGAN has better generation ability than VAE and WAGN,can obtain more ideal variants,and the generated protein sequence is more similar to the original sequences.
Keywords/Search Tags:deep learning, generative model, variational autoencoder, temporal convolutional network, generative adversarial network
PDF Full Text Request
Related items