| Genotype imputation is an important part of human,animal and plant genome sequence analysis,and its results can be used in genetic analysis studies of organisms,including genome-wide association analysis and genomic prediction.In modern genome sequencing,some single nucleotide polymorphisms(SNPs)cannot be accurately detected due to the limitation of sequencing depth,which generates a large number of random deletions and brings obstacles to biogenetic analysis based on genomic technology.Therefore,accurate imputation of missing genotypes is of great importance.Most of the existing padding methods are based on the dynamic linkage disequilibrium property of genetics to restore the data of SNPs loci that were not detected during the real gene sequencing.For example,the popular Beagle and Minimac methods both perform genetic analysis and estimate the missing genotype values based on the reference template data and genetic map of the genotype to be imputed,but because they are both linear imputation methods based on HMM,it is difficult to capture the nonlinear relationship between the genotypes of different loci before,and they are limited in imputation accuracy.Deep neural networks have a strong feature learning capability,in which the auto-encoder technique can effectively solve a variety of data missing problems and can be used for accurate imputation of missing genotypes.In this study,we developed a imputaion method for Residual Convolutional Denoising Autoencoders(RCDA),and added a sliding window step to this method,namely Slide Residual Convolutional Denoising Autoencoders(SRCDA),and also completed the corresponding software system.We examined the imputaion accuracy at 10%,50%and 90%missing rates using genomic marker data of rice,maize and wheat.The experimental results show that the accuracy of SRCDA is higher than that of SVD algorithm,KNN algorithm,random forest algorithm,previous deep learning methods SCDA,and currently popular imputaion methods Beagle and Minimac in all cases.SRCDA can not only provide more accurate prerequisite data for a wide range of genetic analysis studies in biology,but also provide deep learning methods in the field of genotype imputation,which opens the space for the generalized application.In this study,various technical improvements were made based on the general denoising auto-encoder.These include building a sparse convolutional denoising autoencoder model suitable for genotype imputation,using segmented sliding windows for training gene sequences in the imputation process,introducing residual blocks and jump connections in the network structure,and using dynamic learning rates.The specific methods are developed in this paper according to the following aspects.1.A convolutional neural network self-encoder model is built to impute in missing markers using chunked sliding windows on the gene sequences,and by overlapping the windows,the results of imputation in the central region with more adequate data features are obtained and spliced.2.Based on the convolutional denoising self-encoder model,we introduce residual blocks into the network to use a deeper model to improve imputation accuracy and to speed up the training of the model.This model uses jump connections when applying the residual blocks to alleviate the problem of gradient disappearance associated with increasing depth in deep neural networks.3.In order to further improve the accuracy of genotype imputation,this study proposes a method to dynamically adjust the learning rate of the network by the model training accuracy and the actual round value after each round of training.4.Based on the model and algorithm designed above,this study develops a imputation system with an interactive and friendly graphical user interface in Linux environment,which can realize the imputation of missing genotype data,including pre-imputation module and direct imputation module.The pre-imputation module constructs missing and implements simulated imputations for complete genotype template data according to user-set missing rates,predicting the ability of the SRCDA model to imputate the genotypes of current species;the direct imputation module directly imputations the missing genotype data to obtain the complete genotype file after imputation. |