Font Size: a A A

Analysis And Recognition Of Replication Initiation Sites In Eukaryotic Genome Based On Machine Learning Algorithms

Posted on:2024-08-02Degree:DoctorType:Dissertation
Country:ChinaCandidate:F Y DaoFull Text:PDF
GTID:1520307079452374Subject:Biomedical engineering
Abstract/Summary:PDF Full Text Request
DNA replication refers to the process in which the parental DNA molecule produces two daughter DNA molecules according to the principle of complementary base pairing during the S phase(DNA synthesis)of the cell cycle.The process is the basis of cell life activities is one of the most basic processes that occurs within a cell, ensuring the accurate transmission of genetic material to daughter cells during each cell division.Therefore,DNA replication plays an important role in maintaining the stability of genetic information of biological species.For the eukaryotes with large genomes,multiple origins of DNA replication sites(ORI)are required to deal with more complex cell cycle and regulatory mechanisms to complete the transmission of genetic information.Therefore,how to precisely coordinate the synthesis of large amounts of DNA is a critical factor to guarantees the fidelity of genetic information before cell division.The first step to study the mechanism of replication initiation is to identify ORIs in the genome with high precision and large scale.Previous ORIs identification models have limitations in terms of species richness,feature validity,and disease association analysis.Therefore,the focus of this dissertation is to analyze and identify ORIs from multiple species,multiple angles,and multiple scales.The process of DNA replication is affected by DNA sequence,genome epigenetic marks,the three-dimensional chromatin structure and replication timing.This dissertation aims to associate DNA replication with epigenetic marks and the three-dimensional chromatin structure,dig deep into the regulation mechanism of replication initiation,build a regulatory recognition model that truly reflects the law of replication occurrence.And then,explore the relationship between DNA replication initiation and disease from the time dimension,and also systematically analyze the relationship between replication timing and cancer occurrence.The specific research content of the dissertation is as follows:(1)In view of the species limitations of ORIs recognition models,this dissertation analyzed the sequence conservation patterns of ORIs in different eukaryotic species.And then it is devoted to develop the first multi-species eukaryotic ORIs prediction platform based on DNA sequence information,called i ORI-Euk(http://lin-group.cn/server/i ORI-Euk/),which solves the defect of insufficient availability of ORIs prediction software in actual use.Specifically,the following steps are used to construct the integrated bioinformatic predictor: Firstly,collect ORI DNA sequences of multiple eukaryotic genomes to construct benchmark datasets,and convert the DNA sequence into feature vectors by k-mer and single nucleotide binary encoding.Then,perform feature set combination and features selection using F-score to generate optimal feature set for each species or cell line.Next,models are trained by support vector machine(SVM).The results of 5-fold cross-validation show that the prediction accuracy of the i ORI-Euk model reach 80%~94%,which prove the robustness of the model.In addition,the comparative analysis between the algorithms shows that i ORI-Euk improves the accuracy by 4%~18% compared with published tools.Based on the proposed model,an efficient web-server platform i ORI-Euk was built,users can easily obtain the information of potential ORIs in the whole genome,which can meet the needs of users for multi-species.(2)The triggering of replication initiation is not only regulated by DNA sequence information,but also involves a complex epigenetic regulation mechanism,which requires the combined action of epigenetic markers to trigger replication initiation events.It may not be sufficient to accurately locate ORIs by using DNA sequence information alone.For this reason,this dissertation analyzed the relationship between epigenetic marks and ORIs in the human genome,and explored their ability of epigenetic information to identify ORI.The results of the above analysis show that ORI is highly coupled with active epigenetic marks and chromatin accessibility,epigenetic marks are significantly different enriched between the ORI regions and its flanking regions,and the ORI regions are found to be enriched in transcription factor DNA motifs with high GC content.These findings suggest the potential ability of epigenetic mark information to predict ORIs.Therefore,feature encoding algorithms are designed to describe ORIs from epigenetic marks and transcription factor DNA motifs.The prediction results of the model trained by Random Forest on the test set show that the feature sets for epigenetic marks and transcription factor DNA motifs yield prediction accuracy of 0.9033 and 0.9042,respectively,which demonstrates the effectiveness of epigenetic information in identifying ORIs.(3)DNA replication initiation also related to the three-dimensional chromatin structure.Generally,ORIs that are clustered together within chromatin topological domains have a higher probability of initiation.Based on this,three-dimensional chromatin conformation data was used to systematically explore the impact of three-dimensional chromatin interaction on the DNA replication initiation in human genome.The specific research strategy is as follows: Firstly,three-dimensional chromatin interaction information is used alone to identify ORIs.The results show that the three-dimensional chromatin interaction information could effectively identify ORIs(AUC=0.8488).In order to obtain a more accurate prediction model,the feature fusion strategy was used to obtain the multi-modal feature set,which include three parts of features: chromatin interaction features,epigenetic marks and DNA motifs.And the recursive feature elimination(RFE)technique was applied to generate optimal feature subset.The results show that the multi-modal feature fusion strategy can significantly improve the predictive performance of the model(AUC=0.9627).In addition,the feature selection method effectively removes a large number of redundant features to further improve the robustness of the model(AUC=0.9638).Finally,it demonstrated that multi-modal epigenetic information can accurately identify ORIs of human genome,and explained the importance of three-dimensional chromatin structure for ORIs identification,provided an example of identifying eukaryotic ORIs by multi-modal features.The ORIs of eukaryotic genome identification framework based on multi-modal features is freely available from Github(https://github.com/lin Ding-group/i ORI-Epi).(4)The first three studies in this dissertation break through the research limitations of ORIs with multi-species and multi-modal features,but lack of exploring the mechanism of replication initiation from the perspective of time,and lack of association analysis with diseases.Previous studies have shown that the disorder of replication initiation time can seriously affect DNA replication,leading to mistransmission of genetic information,and confirmed that replication timing is related to poor prognosis of childhood leukemia and progeria.The replication timing provides a new molecular pathway and biomarker to assist in the early diagnosis of diseases and the precise prediction of therapeutic effects.Thus,the properties of replication timing in the human genome were analyzed in this section.The findings uncovered conservation of replication timing within the same cell type as well as specificity across different cell types,and implicated the potential information of replication timing in defining cellular identity.In addition,based on the high correlation of replication timing and chromatin conformation Hi-C map,a machine learning method was used to construct a chromatin interaction recognition model.The model can predict the key chromatin interaction of cancer by replicating timing information,and can distinguish cancer samples from normal samples with high accuracy.In summary,this dissertation focuses on bioinformatics research on the ORIs of eukaryotic genomes.Systematically analyzing the regulatory relationship between DNA replication,epigenetic marks and three-dimensional chromatin structure.Exploring the effectiveness of various machine learning algorithms and feature encoding strategies to construct a regulatory identification model that truly reflects the law of replication occurrence.And systematically characterizing the relationship between replication timing and cancer occurrence.Therefore,the complete experimental system of this dissertation will provide valuable guidance for the research of replication regulation mechanism.
Keywords/Search Tags:Bioinformatics, DNA replication, Epigenetics, 3D chromatin interaction
PDF Full Text Request
Related items