Font Size: a A A

Research On The Detection And Classification Method Of Cancer DNA Sequence Variation Based On Data Mining

Posted on:2022-05-03Degree:DoctorType:Dissertation
Country:ChinaCandidate:G B ChenFull Text:PDF
GTID:1484306575462514Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
DNA sequencing method accelerates the research and development of biology and medicine,and is an important way to implement precision medicine and gene drug development.DNA sequencing is to discover variations in sequencing sequences and explore the correlation between gene variations and diseases.For example,establishing the correlation between variations in DNA sequencing data and cancer has become an important technical means to detect and predict cancer and can effectively guide clinical treatment.Therefore,the detection of DNA sequence variation has important scientific significance and application value,providing new choices for scientific analysis of diseases and discovery of new treatment schemes.However,related research is still in the development stage,and further exploration and discovery of the causes of diseases caused by gene variation are still needed.On the basis of analyzing the research status of gene variation at home and abroad,this dissertation carries out the application research of data mining in DNA sequence variation detection.This dissertation focuses on the design of DNA sequence targeted capture probes,sequence variation in targeted sequencing,detection variation by PCR primer matching algorithm,gene chip expression profile and key gene analysis.This dissertation analyzes and excavates the variation in human DNA sequence to establish the correlation between diseases and cancer gene variations.The creative achievements of the dissertation are shown as follows:(1)Aiming at the problems of specificity of probe design,difficulty in determining Tm value and optimal position,an optimal position matching algorithm is proposed to design probe sequences that meet the requirements.The dissertation designs the optimal matching position algorithm to determine the sequence specificity and uses GC content and distribution rules to evaluate the optimal sequence that meets the Tm value requirements,which can intelligently analyze the optimal probe sequence in the whole DNA sequence.Through the probe design verification of BCRA1 exon,the probe sequence meeting the requirements can be quickly matched.(2)Aiming at the detection of SNP and In Del mutations in targeted sequencing sequences,a DNA sequence matching algorithm based on position index relation is proposed to establish the position index relation of DNA sequences and analyze SNP and In Del mutations.Firstly,the sub-sequence is divided into k fixed sequences and links are established;Secondly,the location difference in the optimal link is analyzed,and the determination model of location variation is established;Finally,the target region of targeted sequencing covers the whole coding region,exon-intron junction region(20-50bp)and part of intron region of BRCA1/2 gene,totaling 703 exon regions.The experimental results show that the location-based indexing method can detect more mutation points than Bcftools,Freebye,Vanscan2 and GATK by capturing actual data from 101.3 k region as an example.(3)In the targeted sequencing based on specific primer amplicon technology,DNA sequence alignment may be mismatched or missed mutation points,and so on.An algorithm based on PCR primer sequence matching to target sequence is proposed.Firstly,sequencing sequences are sorted and the number of the same sequences is counted to reduce the number of matches;Secondly,the sequencing sequence is matched with the PCR primer sequence to quickly match all the sequences in the target region,and the local optimal algorithm accurately detects the variation in the target sequence;Finally,compared with the traditional sequence alignment method,the experimental results show that PCR primer sequence matching method can match more sequences,find more variations,and also show better performance in recall rate.(4)Aiming at the problems of large number of genes,poor typing effect and lack of consideration of gene correlation in key gene selection in gene expression,a gene chip classification algorithm based on SVM-REF and a Page Rank key gene screening method are proposed.Firstly,considering the Log FC and Pvalue values in the gene expression matrix and combining SVM-RFE algorithm to screen different genes,the basic algorithms of SVM and KNN are tested to obtain the optimal parameters;Secondly,the convergence of Pank Rank algorithm is proved,and the importance of each gene node is analyzed by complex network to determine whether the gene is a key gene;Finally,the experimental results show that SVM-RFE-SVM has the best genotyping effect and can be used as a gene chip classification algorithm to analyze gene characteristics.At the same time,PR values under different complex gene networks can be calculated to screen out different types of regulatory genes,and key genes can be determined by combining the sequencing of several genes.
Keywords/Search Tags:DNA sequence variation, gene chip, targeted sequencing, SNP, InDel
PDF Full Text Request
Related items