With the development of the gene sequencing technologies,the amount of information of genome sequence presents the trend of explosive growth.There is a pressing need for a set of tools to analyze these DNA sequences.The first step is “gene recognition” or “gene prediction” and this procedure is to recognize coding regions,the information on which will be used to synthesize protein.The difficulty of gene prediction is the algorithm of gene prediction of eukaryote.Comparing with prokaryote,the unique feature of eukaryote is that the coding regions are not consecutive.There are exons and introns in eukaryote but only exons can be encoded into protein ultimately.The boundary site between an exon and an intron is called a splice site.So the prediction of splice sites is the key task in the recognition of eukaryote gene.Actually,this task can be converted into the binary classification of the DNA text.Support vector machine(SVM)and related kernel methods are now widely concerned in the research of the prediction of splice sites.There are usually two kinds of kernel functions used in bioinformatics.One is based on feature space and the other is based on computational similarity of the sequences,which is also called string kernel function.In fact,the string kernel function has achieved state-of-the-art performance in the task of the prediction of splice site.Among the proposed string kernel functions to predict splice sites,Weighted Degree(WD)kernel has the best performance.This dissertation first analyzes the effectiveness of the WD kernel,then presents a hypothesis that the performance of the WD is related to the location of the base conservation.Then this dissertation designs three variables.Two of them are used to describe the distribution of the four bases in a position in positive dataset and negative dataset respectively.And the rest is used to measure the difference of the distribution.Then the conception of “key factor” is defined to measure the importance of a position based on these variables when using WD to predict the splice site.Then an experiment is conducted using this conception on a public dataset.Key factors are calculated in every position and several key positions are picked out in terms of high key factors.By removing and retaining the base information on the corresponding positions,this dissertation proves that the information of the positions has a great effect on the performance of the WD and the key factor can be used to describe the importance of the position.Since the information of the position has a great effect on the performance of the WD kernel,this dissertation extends the meaning of the importance of a position.The effect of a position can be positive or negative.Based on this extension,“confusing factor” is defined to find the confusing positions which may have confusing effects on calculating similarity.Based on the found key positions and confusing positions,different weights are assigned to each position according to their effects and the weights are used when calculating WD kernel function.In this dissertation,such kernel using position weight is called “Adaptive WD” kernel.The result shows that Adaptive WD kernel can get more satisfactory performance than WD kernel on two public datasets.To achieve better performance,this dissertation applies SVM using Adaptive WD kernel as base classifiers,then uses them to construct a Bagging and an Adaboost classifier separately.The result shows that the performance of the two new methods is improved by about 2%. |