| Since the beginning of this century,there have been three large-scale outbreaks of coronavirus worldwide,namely SARS in 2003,MERS in 2012,and COVID-19 in 2019.These persistent coronaviruses have seriously affected global public health.Traditional means of prevention and control are relatively passive,so it is necessary to actively understand the molecular characteristics of future pandemic viruses by using deep generation technology,expand the protein sequence space of viruses,and promote the development of coronavirus related vaccines and drugs.In this paper,we use the variational autoencoders to construct the generative model of coronavirus protein sequence,carry out the generation research on coronavirus spike protein sequence and spike protein functional cluster,and verify and preliminarily analyze the generated sequence.The specific research contents are as follows: 1)Coronavirus spike protein sequence generative model(Co V-VAE).This model takes raw sequence data of seven coronavirus spike proteins as input,encodes them using convolutional neural networks,and decodes spike protein sequences using mixed convolutional networks to generate 1200 aa coronavirus spike protein sequences.Further evaluation and analysis are conducted from sequence similarity(81%),Shannon entropy,key sites,and coverage(greater than 85%),The results indicate that the model can effectively generate coronavirus spike protein sequences,with reliability and diversity.2)Coronavirus spike protein functional cluster generative model(CFC-VAE).Firstly,the preprocessed spike protein functional cluster sequence is used as input data for the model,and a fully connected neural network with two hidden layers is used for encoding and decoding to achieve sampling and generation of functional clusters with a length of 400 aa.Furthermore,the generated sequences are evaluated and analyzed in terms of sequence distribution,Shannon entropy,key sites,and coverage(greater than 73%).The results show that the model can stably generate functional cluster sequences.In this paper,we have established two coronavirus protein sequence related generative model,and compared CFC-VAE with the existing Ar DCA model.The results show that CFCVAE(more than 80%)generation is better than Ar DCA(less than 80%)model.By learning the features of coronavirus protein sequences and generating artificial sequences that conform to the biological characteristics of coronaviruses,the protein sequence space of coronaviruses can be expanded,providing new insights for epidemic prevention and control. |