Font Size: a A A

Research On Signal Sequences Analysis And Related Characters Of Gene Splicing

Posted on:2007-03-14Degree:DoctorType:Dissertation
Country:ChinaCandidate:C YanFull Text:PDF
GTID:1118360215970549Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of biological technology and computer technology, bioinformatics is currently one of the most active cross subjects. Splicing is an important step of gene transcription. The spliced sequences directly determine the products of transcription, protein. The existence of alternative splicing makes it possible that different protein can be expressed by the same gene, which results in the complexity of live. The research of splicing and alternative splicing is helpful to understand the expression of gene, which has been one of the main spots of bioinformatics. This thesis develops the research on signal sequences and related characters of gene splicing and alternative splicing. The central work and the creative achievements of this thesis as follows:(1) The research on splice sites identification in coding regions. One important aim of splice sequences analysis is to precisely identify the position of splice sites, including donor sites and acceptor sites. In this thesis, hidden Markov model is used as the main model of splice sites identification. According to the correlation between bases of donor sites signal and acceptor sites signal, the sub models of signal sites is built. Since only the strength of splice signal is not enough to precisely identify the sites positions, two 2-order hidden Markov models for the flank sequences of splice sites are built. Incorporating the signal models and sequences models, the integral models of splice sites identification is realized. Using the actual human gene sequences as materials, the model is tested, and the results achieve comparable performance to the similar software at present.2. The research on splice sites identification in untranslated regions. The untranslated regions of gene are also been spliced during gene transcription, and their exons are preserved in transcripts. However, these exons are not translated into amino acids. Since the exons and the introns of untranslated regions are all non-coding, the transition from protein coding to non-coding DNA is absent and the identification of splice sites embedded in untranslated regions is a challenge in bioinformatics. To improve the performance of splice sites identification, this thesis uses a support vector machine as the identification frame. At the same time, considering the close relationship between splice sites selection and the composition of adjoined sequences,a new kernel, position weight subsequence kernel, is built. Through transformation of this kernel, both the content and position information of subsequence can be integrated, which enhance the characteristic of actual splice mechanism well. Using the actual 5'UTR splice sequences of human gene as materials, the models are tested, and the performance of identification is comparable with existing internal softwares of UTR splice site identification,some performance measures are even better than theirs. 3. The research on oligonucleotide motif finding. There are always some conserved short sequences adjoined signal sites, named oligonucleotide motifs, which play an important role in signal regulation. Finding out these conserved motifs not only helps to identify splice sites, but also helps to understand the biology mechanism of splicing. To do this, this thesis presents a motif finding algorithm based on maximum entropy distribution. Due to the difference of the information gain, a stepwise selection method is used to choose the oligonucleotide sequences which have outstanding information gain as motifs. However, for long signal sequences, this algorithm will cost much time and space, so it becomes unpractical. To solve this problem, the thesis uses sequences decomposing method to break longer sequences into many shorter snippets. At the same time, to reserve the correlation between short snippets, the adjoined sequences between them are considered, instead of breaking these longer sequences simply. This method can reduce the requirement of time and space, and reserve the global information. Since only several sequences among many candidate motifs are the real motifs, a threshold is set. If the occurrent frequence of candidate motif is lower than this threshold, this motife will be excluded from the candidate set. With the motifs selected, the signals can be divided from decoys, which mean that these motifs embody the characters of signals.4. The research on the conservation of alternative splicing between species. Alternative splicing is one of the prevalent live phenomena of backbone animals, which extremely enriches the expression products of genes, proteins. By selecting difference sites to splice, many vary translation products appear, which may bring mutation of species, occurrence of disease and conversion of the biology foundations. Analyzing the conservation of the alternative splicing between different species, the conserved alternative splicing patterns and specified patterns in spices evolution can be achieved. To make a detailed analysis of the relationship between alternative splicing and species evolution, different features are used to analyze the evolution relationship of a base alternative splicing phenomena, exon skipping, between human and mouse. Most of the features demonstrate the strong similarity between the two species. This demonstrates that human and mouse inherit similar alternative splicing manner from the common ancestors, and they have strong relationship, which is consistent with recent researches. At the same time, some features are specified to the single specie, which maybe the specificities of this specie.
Keywords/Search Tags:bioinformatics, splice sites identification, motif finding, exon skipping splicing, evolution analysis, hidden Markov models, support vector machine
PDF Full Text Request
Related items