Font Size: a A A

Identifying Origin Of Replication Based On Machine Learning

Posted on:2021-04-19Degree:MasterType:Thesis
Country:ChinaCandidate:L LiuFull Text:PDF
GTID:2370330614454481Subject:Applied statistics
Abstract/Summary:PDF Full Text Request
With the increasing informationization of society,various fields continue to prom With the increasing informationization of society,various fields continue to promote the integration of science and technology in this process.Bioinformatics,which integrates knowledge from multiple disciplines,came into being,and it is no longer limited to solving problems using only traditional biological experiments.The implementation of the Human Genome Project has led to rapid development of biological gene sequencing projects.In the pre-genetime era of bioinformatics,genetic data containing genetic information has exploded.These huge data have driven the rapid development of many fields of biology such as genomics,proteomics,disease research,precision medicine and so on.In these fields,binary classification and multi-classification problems are frequently encountered problems,such as non-coding RNA recognition,protein homology detection,and site recognition.Among them,the DNA replication initiation site recognition is a kind of site recognition.In this thesis,we first theoretically explain bioinformatics and machine learning,and then formulate corresponding research ideas based on the research tasks.In the empirical analysis,this article uses the genome obtained from the International Yeast Biodatabase as our initial data set,using feature extraction methods such as k-ary nucleotide frequency,Pseudo-nucleotide components,One-hot coding and Word vectors to train A new method for k-base frequency characteristics of fusion DNA sequences and physicochemical properties of type 2 ternary pseudonucleotides is presented.This method mainly optimizes and selects the nucleotide frequency characteristics,and then combined with the improved pseudo-nucleotide components for the second step of feature extraction,in which all ternary pseudo-nucleotide physical and chemical properties are selected for research.Then use PCA to reduce the dimensionality of the feature set,and properly model the newly obtained data set,calculate the prediction accuracy of the classification model under 5-fold cross-validation,and finally obtain a yeast DNA replication start site prediction model based on the SVM algorithm.The results show that the accuracy of the prediction model of the yeast genome replication initiation site Acc reaches 88.05%,which proves the availability of the model compared with existing algorithms.
Keywords/Search Tags:Origin of replication, k-ary nucleotide frequency, Pseudo-nucleotide components, SVM
PDF Full Text Request
Related items