Font Size: a A A

The Prediction Of MiRNA Using SVM

Posted on:2009-06-27Degree:MasterType:Thesis
Country:ChinaCandidate:Y Y HouFull Text:PDF
GTID:2120360245958732Subject:Biochemistry and Molecular Biology
Abstract/Summary:PDF Full Text Request
ObjectiveMicroRNAs (miRNAs) are a class of non-coding RNAs of single-stranded, endogenous, averaging 21 nt in length, which are involved in diverse pathways and play an important role in gene regulation. Up to now, there are over 4000 miRNAs discovered, which exist in 55 species extensively. Although some reports pointed out that the estimated numbers of miRNAs for the human, drosophila and worm are no more than 255, 110 and 120, respectively, it was also reported that the number of miRNAs is much far greater than the estimation and thus a large number of miRNAs remain to be discovered.The main methods for identification of miRNAs consist of cDNA cloning and computational prediction. Although the cDNA cloning is direct and reliable, it is hard to capture the miRNAs expressed in a time-specific or tissue-specific manner, as well as low-expression miRNAs. Nowadays, computational identification of miRNAs has become an important alternative way. The main advantage of computational methods is that the miRNAs expressed in different comditions can be found, which provides support for experimental identification of miRNAs. With the help of computational models, the species-specific miRNAs or homologous miRNAs in different organisms can be found. The basic assumption for computational identification of miRNAs is that the pre-miRNAs have the ability to form stem-loop structures. However, lots of genomic sequences have the potential to fold into stem-loop structures, making the identification of miRNAs very challenging to distinguish pre-miRNAs from large number of pseudo pre-miRNAs. Thus, the construction of models with high sensitivity and high specificity are the key objects for prediction of miRNAs.MethodsTo develop computational models for prediction of miRNAs, we firstly constructed a training dataset and a test dataset. There are 391 human experimentally validated pre-miRNAs in miRBase 9.0 database. We randomly extracted 300 pre-miRNAs as the postive training dataset (PTRAIN1) and took the remaining 91 pre-mRNAs as the positive test dataset (PTEST1). To construct negative dataset, the huam 3'UTR sequences were considered, which were downloaded from UTRdb database (ver. 22). From UTRdb, there were 83437 segments found with stem-loop structures, which meeted the following conditions:①the segment length is more than 55 nucleotides;②there is at least 18 base-pair in the stem-loop structure; and③the length of loop is more than 3 nucleotides. We took these sequences as the pseudo pre-miRNAs. From the 83437 segments, we randomly extracted 300 sequences as the negative training dataset (NTRAIN1) and 91 sequences as the negative test dataset (NTEST1). In addition, we also chosen the following three datasets as independent positive or negative test datasets:①we took the newly added 134 huamn experimentally validated pre-miRNAs in miRBase 10.0 as the positive test dataset (PTEST2).②we randomly selected 1000 pseudo pre-miRNAs as the negative test dataset, which were derived from the 19th human chromosome (NTEST2) and③we took the 1353 miRNAs across 20 kinds of animals and virus except human in miRBase 9.0 as the positive test dataset (PTEST3).For each sample in training or test dataset, we used 85 sequence attributes and 43 structure attributes to describe it. The detailed information is as follows:①The compostions of bases, 2-tuples and 3-tuples were calculated. The total number of features is 84.②GC content was calculated.③Based on the stem-loop structures of pre-miRNAs, following features were calculated, which were the number of interior loops and bulge loops, the number of interior loops or bulge loops, the biggest size of interior loops or bulge loops, the smallest size of interior loops or bulge loops, the number of interior loops or bulge loops varying from 1 to 10nt, the number of interior loops or bulge loops shorter than 5nt, the number of interior loops or bulge loops varying from 6 to10nt, the number of interior loops or bulge loops greater than 11nt, the total size of all the interior loops or bulge loops, the total size of all the interior loops and bulge loops, the number of all loops, the size of the biggest loop, the size of the smallest loop, the number of base pairs, the lowest free energy andthe sequence length, respectively. The total number of features is 42.④The p-value was calculated, which was derived from the comparison of the minimum free energy of pre-miRNAs to those from 1000 randomized sequences keeping the content of 2-ktups.Based on the training dataset PTRAIN1 and PTEST1, the classifier named MiRscreen was constructed using support vector machines. To improve the performance of the classifier, we employied genetic algorithm to search the optimal parameters C andγ, which are two important parameters for SVM classifiers. To improve the generalization of classifier, we also considered the application of multi-classifier system with introduction of attributes-bagging. After the redundancy procedure, we got 73853 pseudo pre-miRNAs from 83437 segments, which were classified into training dataset NTRAIN2 with 55900 pseudo pre-miRNAs and test dataset NTEST3 with 16953 pseudo pre-miRNAs. From NTRAIN2, we randomly selected 300 negative samples, together with the PTRAIN1, to construct a training set. The above procedure was repeated 25 times, and 25 training sets were obtained. For each training dataset, we randomly extracted a certain number of features varying from 1 to 128, and constructed the correspondent models. Finally, the robust classifier SVMensembler50 was selected, which was constructed using 50 features.Results①The sensitivity and specificity of MiRscreen were 99.33% and 100.00% for the training dataset (PTRAIN1 and NTRAIN1), and 91.21% (83/91) and 93.41% (85/91) for the test dataset (PTEST1 and NTEST1), respectively. The overall sensitivities of MiRscreen on PTEST2 and PTEST3 were 85.82% and 88.10%, respectively. Furthermore, the sensitivity was 100.00% in eight species, including simian virus 40, mareks disease virus, rhesus lymphocryptovirus, Epstein Barr virus, xenopus laevis, canis familiaris, ovis aries and macaca mulatta. The model's specificity on the NTEST2 was 85.50%. Compared to previously presented six methods, our model had better performance. The AUC of MiRscreen is 0.921, also greater than that from each of the other six methods.②The sensitivity and specificity of SVMensembler50 were 96.51% and 91.55% for the training dataset (PTRAIN1 and NTRAIN2), and 88.13% and 91.36% for the test dataset (PTEST1 and NTEST3), respectively. The overall sensitivities of SVMensembler50 on PTEST2 and PTEST3 were 87.31% and 91.50%, respectively. Furthermore, the sensitivity was 100.00% in nine species, including human cytomegalovirus, simian virus 40, mareks disease virus, rhesus lymphocryptovirus, Epstein Barr virus, xenopus laevis, canis familiaris, ovis aries and macaca mulatta. Compared to the model MiRsceen and other six previously presented classifiers, the model SVMensembler50 had better performance in both sensitivity and specificity. The AUC of SVMensembler50 was 0.935, far greater than that from the model MiRsceen or each of the other six methods.ConclusionsHere we presented two models, miRScreen and SVMensembler50 , for prediction of pre-miRNAs using SVM. For the first model miRScreen, two important parameters C andγrelated to the model performance was optimized using GA. The prediction accuracy was 92.31% on the test dataset (PTEST1 and NTEST1), which was higher than those from the models by 4.00% or 5.00% with two parameters C andγoptimized using step 1 or 2 in grid seach, respectively. Thus, through the combination of GA and SVM, the model performance was improved.This conclusion may be applied in other SVM-related classification problems.To further improve the model performance, we developed the second model SVMensembler50 through incorporating AB method into SVM. The second model not only had good generalization and robustness, but also had better sensitivity and specificity.
Keywords/Search Tags:microRNA, genetic algorithm, support vector machines, Attribute Bagging, machine learning
PDF Full Text Request
Related items