Font Size: a A A

Research On Identifition Of DNA Replication Origins Based On Sequence Information

Posted on:2019-06-10Degree:MasterType:Thesis
Country:ChinaCandidate:F WengFull Text:PDF
GTID:2370330590973917Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of genomics and the continuo us improvement of the modern sequencing technology,various types of biological data information has exploded.Among them,the growth rate of massive genetic data obtained by various genome projects is the most obvious.However,due to the complexity of gene sequence composition,the structure and function of gene sequences determined by traditional biological experiments are complicated and costly.With the rapid development of bioinformatics,the prediction of the structure and function of gene sequences by computational methods has become one of the most important research contents in the field of bioinformatics.Gene sequence is a vector of biological genetic material,and a DNA sequence having a specific function usually exhibits a specific nucleotide arrangement order and structural composition.The process of DNA replication is semi-reserved,and it usually initiated by a specific region,which is called the DNA replication origins.Accurate identification of DNA replication origins is a necessary prerequisite for further research and understanding of DNA replication mechanisms.In this paper,DNA sequence information,physicochemical properties and long-range and shortrange interactions of sequences,combined with discriminant-based machine learning classification are used to carry out in-depth exploration and research on the prediction of DNA replication origins.The specific research contents are as follows:First,a standard data set based on the complete DNA replication origins sequence information was constructed.The DNA replication origins is complex in structure and is usually a fragment of a DNA sequence of inconsistent length.The data sets used in the discriminant methods for predicting DNA replication origins are all fixed-length DNA sub-fragments taken from the complete DNA replication origins.However,sequencing experiments have shown that the sequence composition of the complete DNA replication origin has a specific nucleotide sequence and nucleotide composition bias,and the interception of the sub-segment will cause a certain degree of loss of typical characteristics.In this regard,a standard data set for the complete DNA replication origin sequence of four species was constructed based on the nucleic acid sequence database GenBank and the eukaryotic DNA replication origins database DeOri 6.0,and the software CD-HIT is used to de-redundant the DNA sequences in the data set to obtain the final standard data set.Second,a three window-based pseudo k-tuple nucleotide composition method(iRO-3wPseKNC)was proposed.There is a specific nucleotide composition sequence and a non-uniform distribution of nucleotide distribution for the complete DNA replication origin,and a presence of a leader chain and a lag chain on guanine(G)and cytosine(C).GC asymmetric nucleotide composition bias,iRO-3wPseKNC divides a complete DNA sequence into three sub-windows by proportional optimization,and extracts features using pseudo k-tuple nucleotide composition method(PseKNC)for each local-window sequence.Typical features between different regions of the sequence,and a random forest algorithm to construct the classifier.Thirdly,a pseudo k-tuple GC composition method(iRO-PseKGCC)that can directly describe the non-uniformity of sequence base distribution was proposed.The window-based pseudo k-tuple nucleotide composition method utilizes only based on the physical composition and chemical property information of the sequence during the feature extraction process,and only distinguishes the features by dividing the window without directly including the feature of uneven base distribution.On the above,this paper further proposes the k-tuple GC composition idea(k-GCC),and integrates the GC Skew value directly describing the degree of GC bias into the PseKNC framework,and obtains the calculation form of the improved method iRO-PseKGCC.The improved method has been significantly improved compared to the iRO-3wPseKNC.Fourth,the calculation of the GC Skew value based on the pseudo k-tuple GC composition method which is calculated based on the local sequence information composed of consecutive k-tuple GC compositions separated by ? was proposed.In order to distinguish the differences in base composition bias of different species sequences,this paper proposes a pseudo k-tuple GC composition method based on fixed-length window and a pseudo k-tuple GC composition method based on accumulating k.The two methods respectively calculate the local part of GC Skew.The level of subsequence length information and the k-tuple GC composition information layer of different dimensions are studied in the dataset classification of different species,and different effects are obtained on different data sets on the basis of grasping commonality.
Keywords/Search Tags:DNA Replication origin, Pseudo k-tuple nucleotide composition, nucleotide bias, GC Skew, Pseudo k-tuple GC Composition
PDF Full Text Request
Related items