| Gene sequencing is the primary means by which humans understand genetic information and plays an important role in cancer research,genetic disease testing,and the prevention and treatment of infectious diseases.Third generation sequencing techniques have been widely used in genome and transcriptome research because of their long reads,uniform sequencing regions,low cost,and high throughput.However,an inherent drawback of this technique is that it produces results with an extremely high error rate,typically reaching 6-15%,severely affecting downstream analyses such as mapping of sequencing reads onto a reference genome and gene sequence assembly.There have been a few computational approaches to reduce the data error rate,which have been split into hybrid correction and self-correction strategies depending on whether additional next-generation sequencing data are used or not.Among them,the self-correction strategies require no additional next-generation sequencing and has better ease of use.These self-correction methods,although using different algorithmic principles,essentially all use only the frequency at which a base occurs at a single position for correction,do not utilize associated information frompre-and-post sequence,are therefore more dependent on the amount of input data,and do not perform well at low sequencing depths.Meanwhile,most methods cannot load on the computational need for large genome sequencing data error correction due to the high computational resource requirements.In response to the above insufficiencies,this thesis sheds light on how deep learning can be applied to the self-correction of third-generation DNA sequencing data.At first,datasets of five species(Escherichia coli,Saccharomyces cerevisiae,Drosophila melanogaster,Arabidopsis thaliana,and human)were produced by comprehensively collecting data from open-source projects and public databases.Secondly,datasets were analyzed for read length and error type proportion,with the error correction task determined to focus on targeting both types of errors as insertions and deletions on reads less than 30000 BP in length.Then,considering that the DNA sequencing data error correction task has a large difference from the evaluation criteria of traditional deep learning task,in this paper,an automated evaluation method for error corrected results is proposed.The evaluation method,considered in three directions from error correction performance,resource requirements and downstream applications,contains ten evaluation metrics and is finally implemented as an open-source software named LoRSCA(Long Reads Self-Correction Assess).We next propose a sequence encoding method that integrates data information from three types,base sequence,sequencing quality,and alignment quality,and encodes onedimensional sequence data into two-dimensional images with dimensions of 21 * 4 * 3 for feature extraction.Finally,a multi-task learning based convolutional neural network named DeepSC is constructed,using the design idea of multi-branch convolution and hop layer connectivity to solve the self-correction problem of third generation sequencing.In a comparative evaluation with the existing ten error correction algorithms,DeepSC became one of only four algorithms that worked smoothly on the human genome and led the way in sensitivity,output depth,and genome coverage on all five species at low sequencing depth.In this paper,deep learning methods are introduced to deal with third generation DNA sequencing error-correction problems,which improve the error-correction performance of computational methods on low sequencing depth data and provide a new idea for future error-correction algorithm design. |