Font Size: a A A

Yeast NcRNA Prediction Based On Machine Learning

Posted on:2008-07-11Degree:MasterType:Thesis
Country:ChinaCandidate:H LiFull Text:PDF
GTID:2120360215960623Subject:Biochemistry and Molecular Biology
Abstract/Summary:PDF Full Text Request
It is found that there are a large number of non-coding RNAs (ncRNA) in the genome. These ncRNAs play key roles in the regulation of gene expression and so on. How to identify these novel ncRNAs in bioinformatics ways and then providing further help for experimental identification of these ncRNAs has become one of the hot fields in bioinformatics. In this study, we systematically investigate the genome-scale ncRNA identification for Yeast using machine learning method with k-tuple compositions as feature variables.The first step is to construct training set and test set respectively. The ninety ncRNAs located in the intergenic region were taken as a positive set, and the 1000 nt upstream and downstream of these ncRNAs were extracted. In order to construct a negative set, the 4,058 protein-coding genes with good annotations were used. Because of the imbalance of sample sizes in positive set and negative set, the cluster analysis was used to reduce the redundancy among 4058 protein-coding genes, with the compositions of 3-tuples of the coding region, and 4-tuples of the upstream and the downstream 1000nt sequences as features. The ninety representative samples were selected as negative samples. At last, we randomly selected eighty samples from the positive and negative sets respectively. The training set consisted of these one-hundred and sixty samples. The rest samples are used to constructed test set.The second step is to build classifiers using Naive Bayes method and support vector machines (SVM). The results are as follows.1. Based on the 3-tuple compositions of the mature ncRNAs and the coding regions of protein-coding genes as features, the accuracy of the train set and the test set is 85% and 90% respectively with Naive Bayes classifier, whereas the accuracy of the train set and the test set is 98.75% and 90% respectively with SVM.2. While using the 4-tuple compositions of the upstream sequences of the ncRNAs and the protein-coding genes as features, the accuracy of the train set and the test set is 93.73% and 75% respectively with Naive Bayes classifier, whereas the accuracy of the train set and the test set is 100% and 90% respectively with SVM.3. While using the 4-tuple compositions of the downstream sequences of the ncRNAs and the protein-coding genes as features, the accuracy of the train set and the test set is 93.75% and 85% with Naive Bayes classifier, whereas the accuracy of the train set and the test set is 100% and 90% respectively with SVM.Finally, we use the three classifiers constructed with SVM to scan the intergenic sequences of Yeast genome and find 7,469 ncRNA candidates. There are 76 known ncRNAs in the candidates, which covers 84.4% of 90 known ncRNAs located in the intergenic regions. Our study provides a good bioinformatics support for the experimental identification of ncRNAs in yeast genome. The strategies used in this study can be also applied to identify ncRNA genes in other genomes.
Keywords/Search Tags:ncRNAs, k-tuple, prediction, machine learning
PDF Full Text Request
Related items