Identifying Origin Of Replication Based On Machine Learning

Posted on:2021-04-19

Degree:Master

Type:Thesis

Country:China

Candidate:L Liu

Full Text:PDF

GTID:2370330614454481

Subject:Applied statistics

Abstract/Summary:

With the increasing informationization of society,various fields continue to prom With the increasing informationization of society,various fields continue to promote the integration of science and technology in this process.Bioinformatics,which integrates knowledge from multiple disciplines,came into being,and it is no longer limited to solving problems using only traditional biological experiments.The implementation of the Human Genome Project has led to rapid development of biological gene sequencing projects.In the pre-genetime era of bioinformatics,genetic data containing genetic information has exploded.These huge data have driven the rapid development of many fields of biology such as genomics,proteomics,disease research,precision medicine and so on.In these fields,binary classification and multi-classification problems are frequently encountered problems,such as non-coding RNA recognition,protein homology detection,and site recognition.Among them,the DNA replication initiation site recognition is a kind of site recognition.In this thesis,we first theoretically explain bioinformatics and machine learning,and then formulate corresponding research ideas based on the research tasks.In the empirical analysis,this article uses the genome obtained from the International Yeast Biodatabase as our initial data set,using feature extraction methods such as k-ary nucleotide frequency,Pseudo-nucleotide components,One-hot coding and Word vectors to train A new method for k-base frequency characteristics of fusion DNA sequences and physicochemical properties of type 2 ternary pseudonucleotides is presented.This method mainly optimizes and selects the nucleotide frequency characteristics,and then combined with the improved pseudo-nucleotide components for the second step of feature extraction,in which all ternary pseudo-nucleotide physical and chemical properties are selected for research.Then use PCA to reduce the dimensionality of the feature set,and properly model the newly obtained data set,calculate the prediction accuracy of the classification model under 5-fold cross-validation,and finally obtain a yeast DNA replication start site prediction model based on the SVM algorithm.The results show that the accuracy of the prediction model of the yeast genome replication initiation site Acc reaches 88.05%,which proves the availability of the model compared with existing algorithms.

Keywords/Search Tags:

Origin of replication, k-ary nucleotide frequency, Pseudo-nucleotide components, SVM

Related items

1	Research On Identifition Of DNA Replication Origins Based On Sequence Information
2	Theoretical Prediction Of Nucleosome Position And Online Software Development
3	Genetic Correlation Of LPA Gene Single Nucleotide Polymorphism Rs10455872 And Rs3798220
4	The Application Of Arabidopsis UDP-sugar Pyrophosphorylase (AtUSP) In The Synthesis Of Sugar Nucleotide
5	Nucleotide-mediated Synthesis Of Pt Nanoclusters And Their Application In Biological Detection
6	Genomic Evolutionary Analysis With Distance Of Nucleotide
7	Single Nucleotide Polymorphism Of Nucleic Acid Detection Technology In The Test Strip
8	The Study Of Strand Compositional Bias And Its Underlying Mechanism In Bacterial Genomes
9	The Preparation And Separation Of Single Nucleotide
10	Development Of A Novel Nucleic Acid Detection Method Based On ATP-Releasing Nucleotide Probe