Font Size: a A A

Research Of RNA Splicing Recognition

Posted on:2010-05-28Degree:MasterType:Thesis
Country:ChinaCandidate:X Q YanFull Text:PDF
GTID:2178360272496381Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the human genome project completed, humanity has entered a post-genomic era,the research of gene expression regulation and gene function becomes the core content ofthe bioinformatics , and the analysis of genome structure, gene identification and functionalprediction of biological research in the international scholars become more and moreimportant. The use of experimental, statistical analysis, machine learning methods such asgenomic information on the structural analysis and feature extraction, we can study genefunction to identify and provide the necessary basis. RNA splicing is an important step of theprocess of eukaryotic gene expression, which is in higher eukaryotes in particular the humangenome are very common occurring in the life of a complex situation, and gene expressionregulation, signal transduction and other processes important in close linked with the diseaseor have a great relationship. For this reason, The research of RNA splicing becomes the studytime of functional genomics, this research not only in functional genomics has veryimportant significance in theory and research on drug design, disease control and other issuesalso have important practical significance.Gene is a DNA sequences which carrying genetic information, is a basic genetic unitscontrolling characters. The nucleotide sequence of the amino acid sequence in DNA storedthe protein coding information, store the information of gene expression and is a storage ofgenetic information, it can be said the DNA sequence contains the most basic life informationso the identification of DNA sequence is very important, gene identification use biologicalexperiments or by means of computer identify the DNA sequence of the fragment with thebiological characteristics. The experimental results show that eukaryotic genes, the splicing isa common appearance, gene is a chimera, which contains two sections: a section from thegenetic code (coding region) which be expressed will be known as "exons"; a section ofgenetic code by a non-(non-coding region), will be removed in the mRNA, known as"introns", from the DNA to mRNA transcription, the intron at transcription was cut out, that isthe RNA editing, therefore, exon and intron junction (splice site) is the key to theidentification, sequence analysis showed that almost every intron 5 ' - end of the start of thetwo bases are GT, 3'-end of the last two base pairs are always AG. Due to the two bases hashighly conservatism and a wide range of existence, it was known as the GT-AG rule, theexistence of GT-AG rule makes using it as a standard sample for true and false identificationpossible. At present, the various splice site of the automatic recognition algorithms are basedon the vast majority of GT-AG sequence adjacent section to determine whether it is really the splice site, so the key is how to forecast better in this section of the extraction sequencebehind the statistical characteristics and biological characteristics, and better design ofclassification algorithms, this paper count the conservative near the real and false splice sites,which provide a basis for further prediction.We use the support vector machine algorithm to identify splice sites, support vectormachine is a new learning theory based on statistical machine learning, which meet therequirements of structural risk minimization, so it is widely used in the field of patternrecognition, In recent years, people use it in biological sequence analysis, support vectormachine training samples from the extraction of statistical characteristics of a good, finallyrealize the non-linear classifier can be more than ever a more effective way to distinguish trueand false samples. By constructing our model of support vector machines for prediction ofsplice sites, we selected data sets from the human genome database HS3D, this database ofgene splice site sequences are from Genbank database, we put into training data set and testset, and extract the different characteristics of data sets for the pre-processing and formatting,through the training set the training of support vector machine model, and use this model toidentify the test set, in the experimental process, the support vector machine Whether or nothas a good performance, model selection are the key, including the type of kernel functionselection and identification of nuclear function-related parameters after selection, we selectthe RBF kernel function, the use of K-times cross-validation to select optimal parameters. Wecount and analyze the splice site sequence, we found not only exist specific splice site GT-AGrule, but also has the same sequence, which provides us better identify the information. Forthe use of machine learning approach to identification, the training set of feature extraction iscritical, we extracted three different feature vectors to identify, based on the combination ofthe probability of base pairs, based on the four-dimensional vector, and based on the EIIPvalues feature extraction ,the experiment show that each eigenvector has some effect forpredict site ,due to the effect of each feature extraction ,so we put these features together toidentify and achieved good results , comparison with other identification methods we can findthat our method of prediction results have improved, because we also add the featureextraction of the base combination of probability and the EIIP value, through the use of ourEIIP values we can get a DNA sequence in the free-electron energy distribution of the numberof sequences, so it can be able to reflect the forecast at a splice site sequence of the structureand physical and chemical properties, provides a new idea for splice site identification.Because of the false splice sites of meeting the GT-AG rules is much larger than the numberof true splice sites ,so when we construct the training set the number of false samples shouldmany times the number of real samples . Therefore, the recognition rate of the real samplescan not be guaranteed, we propose the use of weighted support vector machine to resolve thisproblem, and comparison with other methods, experimental shows we have achieved goodresults . The alternative splicing sites, we make a brief introduction, and provide a new clue inaccordance with the text and the prediction of our experimental results. According to our experimental results, by strengthening the structure of the informationis not good to be able to improve the gene splice site recognition results, combined with aspecial sequence of statistical data can extract the real editing site from false splice sites better,because organisms are a complex system, combined with the specific characteristics ofbiological significance can be able to bring more in-depth identification of research, can beable to gain a better understanding of the body of the splicing mechanism, such as theadoption of this paper, the value of the EIIP prediction of splice sites. From the perspective ofPattern recognition and machine learning, extract decisive and high accuracy feature of splicesites is the key of recognition. The future of Bioinformatics direction of development is usinginformation science theory and implementation technology to extract effective feature, with aview to better enhance eukaryotic gene prediction accuracy.
Keywords/Search Tags:RNA splicing, Splice site, Support Vector Machines, Recognition, Alternative Splicing
PDF Full Text Request
Related items