Font Size: a A A

Study On The Identification Of Eukaryotic Gene Splice Site Algorithms

Posted on:2011-09-17Degree:MasterType:Thesis
Country:ChinaCandidate:J J LvFull Text:PDF
GTID:2120330332460091Subject:Pattern Recognition and Intelligent Systems
Abstract/Summary:PDF Full Text Request
Alternative splicing in higher eukaryotes is a key regulatory mechanism of gene expression, as it generates numerous transcripts from a single protein-coding gene, which largely increases the use of genetic information. And an increasing number of examples illustrates that alternative splicing are frequently associated with human diseases. Alterations in splicing patterns could contribute to cancer. The key questions in alternative splicing regulation are: How are splice sites recognized in the vast genomic sequence? This thesis built splice sites identification models for coding regions and untranslated regions in eukaryotic gene, focusing on this question. The central work as follows:1. The research on method for splice sites identification in eukaryotic gene coding regions. Through comprehensive considering the sites signal information, flank sequences information of splice sites, secondary structures information of flank sequence, and the different statistical characters of bases composition in donor sites and acceptor sites, different splicing factor mechanism of action, donor sites identification signal model, acceptor sites identification signal model, donor sites identification sequence model, acceptor sites identification sequence model were built respectively. Then the Mfold package in Vienna soft was used to predict the most stable secondary structure of flank sequences. The predicted structures were converted to a string of two-symbol alphabet. With the combination of S and L symbols and four-letter nucleotide alphabet, each sequence was converted to an eight-letter alphabet sequence. The sequence- structure combination strings were used for training signal models, sequence models above. The integral models which combination of signal information, sequence information and structure information achieved a good performance for splice sites identification.2. The research on method for splice sites identification in eukaryotic gene untranslated coding regions (UTR). The seem to the coding regions,the UTR of eukaryotic gene are also been spliced during gene transcription. However, these exons are not translated into protein during gene translation. Since the state transition from coding to non-coding is absent, the exons and the introns of untranslated regions are all non-coding sequences. The identification of splice sites embedded in UTR is more challenge. To effective use of the knowledge of characteristics, dependencies of nucleotides in the splice sites surrounding region, improve the performance of splice sites identification. This thesis proposed a method based on incorporation statistical characteristics and machine learning theory for splice site detection in UTR. The method consists of two stages: models based on statistical method were built in the first stage and a support vector machine (SVM) with polynomial kernel is used in the second stage. The first stage serves as a pre-processing step for the SVM and takes UTR sequences as its input. It models the compositional features and dependencies of nucleotides in terms of probabilistic parameters around splice site regions. The probabilistic parameters are then fed into the SVM, which combines them nonlinearly to predict splice sites. Using the actual 5'UTR splice sequences of human gene tested the models it shows a good performance.
Keywords/Search Tags:bioinformatics, alternative splicing, splice site identification, UTR, RNA secondary structures
PDF Full Text Request
Related items