Font Size: a A A

Genotype Imputation Method Based On Deep Learning

Posted on:2021-05-01Degree:MasterType:Thesis
Country:ChinaCandidate:L YinFull Text:PDF
GTID:2370330623465034Subject:Electronic and communication engineering
Abstract/Summary:PDF Full Text Request
Genome-wide association study(GWAS,Genome-wide association study)refers to finding out the existing sequence variation within the whole genome of human,namely single nucleotide polymorphism(SNP,Single Nucleotide Polymorphism).GWAS usually focuses on the association between SNP and traits such as major human diseases,but it can also be applied to the analysis of any other genetic variation and the genes and genetic traits of any other organism.Whole genome sequencing provides genetic data support for GWAS analysis and is an indispensable source of genetic analysis.In the modern genome sequencing process,many SNPs are missing due to the ability of gene detection technology,which brings difficulties for genome-wide association analysis.Genotype imputation can increase the ability of genome-wide association analysis and make up for the deficiency of genome-wide association analysis caused by genotype deletion.The current practice is to impute in the dynamic linkage disequilibrium of the genotype itself through the method of computer software and restore the SNPs that cannot be detected during the real gene sequencing process as much as possible.Impute v2 method is a computer software for genotyping observed genotypes and estimating missing genotypes;Minimac is a low-memory,high-efficiency implementation of MaCH algorithm for genotype interpolation,based on genotypes.The classification method can handle very large reference panels with hundreds or thousands of haplotypes.These methods are based on the linear imputation method of HMM.Due to the inherent non-linear characteristics between genotype sites,linear methods often have certain limitations on the accuracy of gene imputation,especially at low and very low frequency gene sites imputation;Moreover,for large reference samples,HMMbased methods usually use sampling to learn the gene transfer probability in the reference sample to save time and lose a certain degree accuracy of imputing;the linear method based on genotypes is also very limited on run time.It usually takes several computers and hours to impute a genotype data.In this regard,we propose to reconstruct the missing genome based on deep learning methods.This method consists of an encoder and a decoder to form a reconstruction network based on full convolution.The encoder compresses the reference samples and the samples to be imputed into multi-channel feature vectors by the CNN.The decoder performs up-sampling and reconstruction on the multi-channel feature vectors and finally gets imputation.The genotype,the encoder layer and the decoder layer are connected by jump connections.This method uses a binary classification loss function and simultaneously evaluates the missing value loss and the non-missing value loss after reconstruction.The innovation of this paper:1.Use U-net network to reconstruct the genotype sequence.U-net is mainly used for segmentation in the computer field.It has a strong feature extraction capability for images and can accurately reconstruct images.U-net is based on full convolution design.The encoder-decoder structure is adopted,and the full convolution has the characteristic that the position of the convolution element is unchanged and is usually used for reconstruction tasks.2.Using the U-net(GPU)-based imputation method reduces the genotype imputation time to a certain extent,shortens the imputation time of the traditional method by two orders of magnitude,and obtains more iteration opportunities for wholegene association analysis.3.Using the U-net-based imputation method can effectively analyze the non-linear association in genes.The feature is that on the basis of a large reference sample,a good filling accuracy can be achieved without sample sampling.The experiment proves that on GPU,the imputation time of this method is greatly reduced compared with CPU,and it has the same accuracy rate as the current most advanced imputation method when the samples are low.According to the analysis of the neural network structure,the imputation accuracy of this method will be higher than that of the linear method based on HMM in the case of large samples.However,due to the fact that large samples are not sufficient at present,the accuracy of the large samples will be improved in future experiments.Finally,the imputation efficiency of the HMM method in the CPU cluster and the GPU-based deep learning method is analyzed.From the analysis of the server's computing power and timeliness,the GPU-based deep learning will be more economical.
Keywords/Search Tags:Genotype Imputation, U-net Network, Gene Sequence Reconstruction, Running Time
PDF Full Text Request
Related items