Font Size: a A A

Research On SMRT-seq Quality Control And 6mA Recognition Model At The Single Molecule Level

Posted on:2022-01-18Degree:MasterType:Thesis
Country:ChinaCandidate:Y X WangFull Text:PDF
GTID:2480306548462714Subject:Biology
Abstract/Summary:PDF Full Text Request
DNA methylation is one of the important phenomena in epigenetics.Common DNA methylation includes 5-methylcytosine(DNA 5m C)and N6-methyldeoxyadenosine(DNA 6mA).At present,there are a variety of methods that can detect 6mA in organisms,including DNA 6mA immunoprecipitation sequencing(Me Dip-Seq),Pacbio sequencing and Nanopore sequencing.Me Dip-Seq cannot detect a single 6mA site,but it can detect DNA 6mA regions in the genome.Pacbio sequencing can identify a single 6mA site by monitoring the fluorescent signal released during base synthesis.But it uses the mean value of IPD in data processing,ignoring the local characteristics of different reads.Based on Pacbio sequencing data,this paper proposes a singlemolecule quality control model based on IPD distribution,designs an IPD comparison and scoring mechanism,performs quality control on the reads at the single-molecule level,improves the efficiency in detection of 6mA,and combines the convolutional neural network method(CNN)constructed a 6mA recognition model.The main research contents are as follows:1.Briefly introduce the research background and significance,biological function and detection methods of DNA 6mA methylation.At the same time,it focuses on the sequencing process and data characteristics of Me Dip-Seq and Pacbio SMRT,discusses the 6mA evaluation standard of the two methods,which lays the theoretical foundation and data support for the subsequent construction of the quality control model and CNN model in this article.2.Constructed a single-molecule level IPD quality control model.Use the sliding window to retrieve the reads calculate its horizontal and vertical IPD distribution,and design the read length scoring strategy;by analyzing the read length score distribution,select nine threshold gradients of0.1-0.9 to experiment on the Chlamydomonas data,using the number of newly added sites in the peak interval of Me Dip-Seq as the evaluation criterion.The experimental results show that when the IPD quality control model takes the threshold value of 0.4,the quality control effect is the best.Among them,Chlamydomonas has added 3474 6mA sites in the peak interval,and six bacteria added 88-212 6mA sites in the peak range.At the same time,this paper also constructed a quality control model based on hypothesis testing.Independent sampling inspection of Chlamydomonas data showed that the quality control model based on hypothesis testing and the IPD quality control model with a threshold of 0.4 both have a good degree of discrimination.3.Constructed a 6mA recognition model based on a convolutional neural network.Two sequence encoding methods,one-hot and i6 m A-pred,are designed to extract the contextual features of the site;the K-means clustering method is used to extract the IPD and PW distribution characteristics of the reads,and the convolutional neural network method is combined to construct a 6mA recognition model.The experimental results of six kinds of bacteria show that the accuracy of 6mA identification can reach 0.84,and the accuracy can reach 0.98 after adding the Ipd Ratio feature.Besides,this article also uses the 6mA recognition model to evaluate the efficiency of the IPD quality control model at the single-molecule level.After IPD quality control,the accuracy of the 6mA recognition model with eight sets of data in the twelve sets of data has been improved,which once again proves the IPD quality of this article.The control model can eliminate lowquality reads and improve the efficiency of the 6mA recognition model.
Keywords/Search Tags:Deep learning, Pacbio SMRT, N6-methyladenine, IPD quality control
PDF Full Text Request
Related items