Font Size: a A A

Research On DNA 6mA Detection Based On SMRT-seq And Machine Learning

Posted on:2021-07-25Degree:MasterType:Thesis
Country:ChinaCandidate:X N XuFull Text:PDF
GTID:2480306311984239Subject:Biology
Abstract/Summary:PDF Full Text Request
DNA N6-methyladenine(DNA 6mA)is the most common DNA methylation modification in prokaryotes.It mainly plays a role in the restriction modification system and can help destroy invading DNA.With the development of sequencing technology,abundant DNA 6mA methylation modification has been detected in different eukaryotic genomes,but how to detect high-quality DNA 6mA from eukaryotes is still a serious challenge.At present,the third-generation single-molecule real-time sequencing technology(SMRT-seq)can detect 6mA at single nucleotide resolution,which has decoded the apparent group of more than 2,000 bacteria and revealed DNA 6mA gene regulation function in bacteria.But affected by other types of DNA modifications in adjacent bases,SMRT-seq will produce false-positive DNA 6mA(FPR),which will not have an essential impact on subsequent analysis in bacteria for its DNA 6mA abundance is very high.But the DNA 6mA content in eukaryotes is many orders of magnitude lower than that of bacteria,for which targeted techniques are needed to reduce detection FPR.Fang Gang et al.used native versus whole-genome amplified samples for comparative analysis,which effectively reduced DNA 6mA error sites,but whole-genome amplification sequencing led to a doubled increase in sequencing costs.In this paper,a DNA 6mA prediction method based on SMRT-seq and SVM is proposed for the detection and identification of DNA 6mA.The effects of feature construction,classification algorithms and detection principles on the efficiency of the method are analyzed through a large number of experimental systems.The main research contents are as follows:1.Briefly introduce the research background,significance and function of DNA 6mA,comparatively analyze the advantages of DNA 6mA detection method,and provide a theoretical basis for the follow-up research of this article2.Two types of SMRT-seq feature extraction methods are designed.According to the principle of Pacific Biosciences three-generation sequencing,IPD was selected as the 6mA evaluation indicator.On the one hand,with the help of the results of the SMRT-seq process and the integration of context information,the comprehensive site characteristics of SMRT-seq were obtained by recursive elimination.On the other hand,starting from the original sequencing data,collating the IPD value data of all positions,using recursive elimination of dimensionality reduction processing,to obtain the single molecule site characteristics of SMRT-seq3.Six types of machine learning algorithms are introduced and compared in detail.Six classification algorithms including logistic regression(LR),linear discriminant analysis(LDA),and support vector machine(SVM)are introduced to compare their performance on the Chlamydomonas dataset.For the comprehensive features of SMRT-seq,LR,K-nearest neighbors(KNN),and classification regression trees perform better,with an accuracy rate of about 97%;while Naive Bayes(NB)has a lower false positive of 6.4%;The recall rate of LDA is 84.4%,which is not sensitive enough,and of several other algorithms are between 85%and 90%.For SMRT-seq single-molecule level features,SVM performs better,with an accuracy rate of 71%and a recall rate of up to 99%4.A DNA 6mA detection method based on SMRT-seq and SVM is proposed.Based on Pacific Biosciences sequencing data,after integrating contextual information,the comprehensive and single molecule site feature of SMRT-seq were obtained by recursive elimination,combined with SVM,a DNA 6mA detection model was constructed,as well as applied in Chlamydomonas and six bacteria.It was found that more than 95%of the detected DNA 6mA in Chlamydomonas was in the motif and the DNA 6mA peak region detected by MeDIP-seq;compared with SMRT-seq,the ratio of the DNA 6mA fell in MeDIP-seq 6mA peak region of six bacteria increased by 2%to 70%.It can be proved that the DNA 6mA detection method based on SMRT-seq and SVM proposed in this paper improves the detection accuracy of DNA 6mA and effectively reduces the false positive DNA 6mA.
Keywords/Search Tags:DNA6mA, SMRT-seq, Machine Learning, SVM, Feature selection
PDF Full Text Request
Related items