Font Size: a A A

Correction Of Nanopore Errors Based On Raw Current Of Nanopore Sequencing Reads And Transformer Model

Posted on:2024-03-31Degree:MasterType:Thesis
Country:ChinaCandidate:W J YiFull Text:PDF
GTID:2530307115996989Subject:Biology
Abstract/Summary:PDF Full Text Request
Nanopore sequencing technology is a single-molecule sequencing technology based on nanopores.It has the advantages of long read length,real-time sequencing,and no need for PCR.However,its error rate is high,which directly affects the subsequent analysis of sequencing data.Therefore,basecalling and error correction in Nanopore sequencing are one of the research hotspots of third-generation sequencing.At present,some basecalling and error correction models for Nanopore sequencing have been proposed one after another.Usually,CTC is used to directly process the current signal,combined with a neural network model,to achieve error correction for Nanopore sequencing.However,their strategies for dividing,extracting,and correcting the original current signal There is still much room for improvement.In response to the above problems,this paper proposes a Nanopore error correction model based on the read-length original current and the Transformer model,and designs a unique molecular marker to achieve accurate matching of second-generation and third-generation sequencing data;constructs an extraction algorithm for 5mer original current,combined with The Transformer deep learning model realizes the error correction of Nanopore sequencing data based on 5mer and improves the quality of Nanopore sequencing data.The main contents are as follows:1.Introduce the development history of sequencing technology,and summarize the application of Nanopore sequencing technology,the causes of errors,the development of error correction methods,and the application of deep learning in the field of biology.The error correction model based on Nanopore sequencing technology was elaborated emphatically,and its advantages and disadvantages were analyzed,which laid a theoretical foundation for this study.2.A raw current extraction algorithm for Nanopore sequencing data is proposed.According to the Nanopore sequencing data format,this paper uses H5 py to extract the original current and sequence information from the original file to realize the extraction of the original current signal at the single-molecule level.Then use Minimap2 to match the read length of the third-generation Nanopore sequencing to the second-generation sequencing data,and then accurately match the second-generation read length and the third-generation read length in the sam file according to the UMI to obtain the error position of the third-generation sequencing data.In this paper,an original current extraction algorithm is designed,which can obtain all the current signals of the original read length,realize the precise matching of bases and current signals,ensure the integrity and accuracy of the original sequencing data,and provide more comprehensive information for subsequent model research.3.Proposed a Nanopore error correction model based on read length raw current and Transformer model.Based on the original current signal of the read length,this paper optimizes the filtering thresholds of different 5mers,designs a hierarchical processing strategy to extract the comprehensive features of the sites,and combines the Transformer deep learning model to construct an error correction model for different 5mers,and analyzes the data Perform consensus processing to make error correction results more reliable.Through experiments on human genome Nanopore sequencing data,it is found that different 5mer models have different classification accuracy rates,for example,the accuracy rate of the ACGCG model is 0.951,the accuracy rate of the CGCAG model is 0.932,and the accuracy rate of the CGCGA model is 0.949.This paper also compares the Transformer model with existing deep learning models,and finds that the accuracy of the Transformer model is 1%-8% higher than that of LSTM and 1%-10% higher than that of CNN.This paper further analyzes the effects of current signal standardization,base sequence information,the number of Multi-heads,and the number of Epochs on model efficiency,and finds that current signal standardization can improve model accuracy by 2%-5%,and base sequence information can improve model accuracy.The rate is 1%-9%;when the Transformer model selects Multi-head = 16 and Epoch = 30,the model efficiency is higher;the model efficiency increases with the increase of the window size and the number of currents.When the window size is 9mer,the number of current signals is When 60,the model efficiency is the best.This paper also discusses the relationship between the threshold value of different prediction models and the error correction effect,and finds that when the model threshold value is 0.4,the error correction effect of the model is the best.When the predicted sequence coverage is greater than 1/2 of the total coverage,consensus processing can improve the accuracy of the model,and can correct 25% of the existing erroneous bases,which is improved compared to the Guppy-4.5 method.
Keywords/Search Tags:Transformer model, Nanopore sequencing, Current signal extraction, Consensus
PDF Full Text Request
Related items