Font Size: a A A

A Research On The DNA Words In The Primary Structure Of DNA Based On The Distribution Of Base Sequence

Posted on:2017-03-01Degree:DoctorType:Dissertation
Country:ChinaCandidate:Z LiFull Text:PDF
GTID:1360330503463230Subject:Epidemiology and Health Statistics
Abstract/Summary:PDF Full Text Request
The genome is the carrier of genetic information,and has important implications for biological traits.A genome is a cell's complete set of DNA.Its functions are mainly determined by its primary structure,namely,base sequence.Currently,more and more genomes had been sequenced,but the parallel progress of functional annotation had not been achieved.Therefore decoding these cryptogram of life is very meaningful.In bioinformatics,a genome is viewed as a cryptogram in a language composed of four letters,A,G,C and T.It is believed that extracting the DNA words of this language is the key of decoding this cryptogram,and defining the features of these words is the difficulty.In this study,it is believed that non-uniform distribution and integrity are two important features of a word,and an algorithm named Nu-Int had been developed to extract the words in one-dimensional long text.According to the results of negative control and positive control,it could be thought the two features are right.Besides,from the results of positive control,the words extracted from different text could be viewed as the“information signature” of corresponding text.Subsequently,this algorithm is applied to the the genomes of Saccharomyces cerevisiae and 10 strains of Escherichia coli.According to the comparison among different DNA vocabulary,the differences and similarities of these vocabulary could be shown not only among genomes but also among the DNA strands.Therefore these DNA vocabulary could also be viewed as the “information signature” of different chromosomes,and further utilization of this algorithm can be extended to the area of taxonomy and evolution.And then the relations between DNA words and gene functions were explored.The logistic models were applied to explore the relations between DNA words and GO terms.Some correlations could be found from the results.Before that the two key features are evaluated,the computational region must be defined.In the above study,every DNA strand was analyzed in whole because these genomes are simple.But it is necessary that splitting a complicated genome into some smaller regions.In this research,the splitting of computational regions is viewed as the identification of contexts,and boil down to the detection of change points.A new algorithm was developed owing to the lack of effective algorithms for detecting changepoints in symbol sequence.The uniform distribution is viewed as a straight line in space,and a symbol sequence is transformed to a broken line around this straight line.When a symbol sequence is a stationary series,its corresponding broken line could be viewed as stochastic disturbance on the straight line.For arbitrary symbol sequence,the degree that broken line deviates from the straight line could be applied to estimate if it follows uniform distribution.But it is difficult to get the distribution function of deviation degree.Therefore the numerical simulation is applied to estimate the deviation probability of a symbol sequence from uniform distribution.According to the results of simulated data(negative control and positive control)and real data,the effect of this algorithm is satisfied.According to the above study about the features of word and the identification of contexts,two feasible algorithms were developed,and a new direction was shown to build DNA vocabulary de novo.On this basis,the detection of synonyms was explored in this research.Synonyms is a common linguistic phenomenon,it is supposed that this phenomenon should also exist in the genome,as the information carrier.It is important for exploring the meanings of the words if the words are classified based on the information of the words.And just as natural language,the synonyms maybe have different symbolic constitution.Therefore it maybe more important for genome to classify the words with different appearances according to their meanings.Currently there are few studies on synonyms without dictionary,and the key is difficult to define the features of the synonyms.In this research,it is believed that the genetic information comes from the DNA sequence with non-uniform distribution,and further inferring that information namely is distribution,and the difference among information namely is the difference among distributions.Therefore,it is thought that the distributions of the synonyms should be similar because the words are the information carrier.Now the detection of synonyms could be viewed as a consistency test of arbitrary distribution.Presently there have been some methods on consistency test of arbitrary distribution.But these methods are not suitable for the detection of synonyms.Therefore,based on the centroid under the cumulative probability curve,an index,transformed centroid,was established to describe the feature of distribution.And a test method on consistency test of arbitrary distribution was developed for single sample and multiple samples.According to the results of simulated data(negative control and positive control)and real data,this algorithm could provide the satisfied results for the the detection of synonyms.In conclusion,this research explored the features of DNA word,the identification of contexts and the detection of synonyms without the assistance of external information,and developed the corresponding algorithms.Although the overall effect of three algorithms is satisfied,there are still some defects.In the algorithm of Nu-Int,bidirectional search was not included and the shorter words have the bigger errors.In the algorithm of the identification of contexts,the numerical simulation has two defects.Its computation speed is slow,and its results could not be perfectly reproduced.In the algorithm of the detection of synonyms,the model of error distribution had nonrandom error and it still needs to be improved.In addition,in the association analysis between DNA words and gene functions,most of the functions could not be correlated with the DNA words.The reasons may be that the the understanding about gene functions are not enough,the logistic models is not enough to reflect the complicated relations,or the algorithm is not perfect enough and too many errors were included.In the end,although this research had some defects,it still presented a feasible research interest for the identification of words in the primary structure of DNA,and provide a reference for exploring the meaning of word.And this research provide a research basis for developing a DNA dictionary.
Keywords/Search Tags:DNA word, non-uniform distribution, integrity, detection of change point, consistency test of arbitrary distribution
PDF Full Text Request
Related items