Font Size: a A A

Identification Of Gene Functional Sites Based On Se-Quence Component And Position Features

Posted on:2013-05-01Degree:MasterType:Thesis
Country:ChinaCandidate:J L LiFull Text:PDF
GTID:2230330374971040Subject:Bioinformatics
Abstract/Summary:PDF Full Text Request
The identification of gene functional sites is the essential of the analysis and an-notation of gene sequences. Accurate identification of gene functional sites has an ex-tensive biological significant. The existing methods of gene functional sites prediction have achieved an acceptable level of accuracy. However, there are limitations.1) It is of prime importance to further increase prediction accuracy, especially since the amount of pseudo splice sites in s genomic sequence is so enormous that even a subtle improvement in prediction accuracy could drastically influence the absolute large number of pseudo sites in predicted results.2) Available algorithms are mainly based on Weblogo, which makes different information content graphs for positives and negatives separately, instead of an integrated graph for positives and negatives. Moreover, the application of these graphs lacks quantitative criteria, such that even with the same datasets, the number and the position of consensus bases determined by different researchers could be different. In this paper, taken splice site and promoter as example, we developed a new method of gene functional sites identification. Firstly, we quantitatively determined the length of the window and the number and position of the consensus bases by a chi-square test; secondly, we extracted the sequence com-ponent and position features of the consensus sites; finally constructed a SVM classi-fier. The prediction results described as fellows.Splice site prediction. Based on support vector machine (SVM), we constructed a novel classification model and applied it for HS3D and NN269dataset. In this sec-tion, we first quantitatively determined the number and location of the consensus bases by Chi-square test, and then extracted the sequence component and the position features of consensus sites. The optimal donor and acceptor results of HS3D dataset are0.922and0.887(Mcc); The optimal donor and acceptor results of NN269dataset are98.93and98.81(auROC). Compared with the present literatures, our method pro-duces a great improvement. Satisfying results show that our method realizes the high accuracy prediction of splice sites.Promoter recognition. In this section, we used the eukaryotic promoter dataset (EPD) as the positive sample and the intron/exon dataset (EID) as negative sample, and regarded transcription start site as centre site. Similar to splice sites prediction, through the chi-square test, we got the location and the base number of conserved region. Based on different combinations of sequence component and position features, we created a variety of promoter classification models. The result of optimal model is0.832(Mcc) that is better than the existing promoter prediction tools. Satisfying results show that our method improves the precision of promoter recognition.Gene functional sites identification based on sequence component and position features obtained excellent results in the splice sites and promoters recognition. We hope that ours methods will be applied to other gene functional sites, and improve the whole accuracies of eukaryotic gene annotation.
Keywords/Search Tags:Gene functional sites identification, Chi-square test, Support Vector Ma-chine, Component features, Position features, Splice sites, Promoter
PDF Full Text Request
Related items