Variant calling is an important biological research in Bioinformatics,Single Nucleotide Polymorphisms(SNP)and In Del(Insertion and Deletion)are very common variation types in Genovariation.The detection of both two variations have been continuously in-depth research with the development of sequencing technology.The existing variant calling technology has good performance on the second generation sequencing datasets but,at the same time,is difficult to achieve good performance on the third generation sequencing datasets.The accuracy of those methods needs to be improved.At the same time,the evolution of deep learning is particularly quick in recent years,which has a significant effect on image classification and has the ability to automatically extract features from images for classification.This paper research the SNP and In Del calling using deep learning technology.The variant calling is regarded as a multi classification regression problem and convolution neural network is used to do parallelly variant calling on genetic data.The results indicate Deep VCall has greater performance on the third generation sequencing datasets--Pac Bio than other deep learning methods.It shows obvious advantages on the variant calling of second generation sequencing datasets,achieve the general quickly method of the second generation sequencing datasets and the third sequencing datasets.The specific work of this paper is as follows:(1)Design the genetic coding algorithms of Deep VCall,the difference between genetic data and other data is the diversity of its features.Select the most key features of genetic data,using One-Hot encoding to code bases and using GIGAR string to correct read sequence’s information,then these specified features are converted into tensors that can be put into convolution neural network to train.(2)According to the tensors created,design a suitable CNN for the genetic data used in Deep VCall.Define a loss function that combine mean square error and cross entropy loss for the characteristics of genetic data.Design Deep Vcall network structure,select hierarchical output;The sequencing quality is use to optimize parameter update method to speed up Deep VCall training.(3)Use Deep VCall to process the same dataset with three kids of feature combinations,analyze the effect and choose the best one.Compare Deep VCall with synchronous and asynchronous update methods,the result shows that Deep VCall trains model faster.Use the best feature combinations to do experimental analysis on the CNN for genetic features.Compare the variant calling effects of Deep VCall with Deep Variant,Scotch and GATK to verify the application effect of Deep VCall on the third generation sequencing datasets Pac Bio.Experiment Deep VCall on different datasets.The single dataset experiment and multi datasets cross experiment are used for experimental analysis.The experimental results show that the accuracy of Deep VCall is better than other deep learning methods on Pac Bio dataset,and the effect is equivalent to other deep learning methods on Illumina dataset.The cross experiment result shows that Deep VCall has good generality on Pac Bio datasets and Illumina datasets and has generalization ability. |