Font Size: a A A

Research On Predicting Method Of MiRNA Based On MiRNA Biogenesis Using One-class SVM

Posted on:2011-08-02Degree:MasterType:Thesis
Country:ChinaCandidate:W YanFull Text:PDF
GTID:2120360305454962Subject:Bioinformatics
Abstract/Summary:PDF Full Text Request
MicroRNAs (miRNAs) are a class of small, single-stranded RNA which are produced by non-protein-coding RNA genes with a length of 21~29nt. They regulate the expression of protein-encoding genes at the post-transcriptional level and the degradation of mRNA by base pairing to mRNA. Mature miRNAs are processed from 60~90nt RNA hairpin structures called pre-miRNA. miRNAs have an important role to play in regulation of gene expression, cell differentiation, etc. The study found that mutation, deletion or over-expression of miRNAs can cause physiological anomaly or diseases, especially for the occurrence and treatment of various human cancers. Accordingly,it is important to discover new miRNAs for the study of gene function and the prevention and cure of diseases. Despite increasing over 10000 genes have been predicted to be miRNAs for now, but there are still a lot unknown miRNAs.The main biochemical method for identifying miRNAs is cDNA cloning which is direct and reliable. However, it is difficult to predicate miRNAs expressed in specific period or specific tissues, as well as low-expression miRNAs.Therefore, computational identification of miRNAs has become an effective alternative way and make up for the lack of cDNA cloning, no longer restricted by temporal specificity and tissue specificity of miRNAs expression and with higher throughout. The existing computational methods have Micscan,Mirseeker,MirPred etc, mainly based structure series analysis, comparative genomics and machine learning. Machine learning approaches always have better predicting results so far. The main idea is predicting miRNAs based on the hairpin structure in the secondary structure of pre-miRNA, however, because of the number of sequences with hairpin is large; the main research orientation of machine learning algorithm is to construct a classifier with high sensitivity and high specificity.The advantage of computational methods is that the miRNAs expressed in different conditions can be found, which provides support for experimental identification of miRNAs. With the help of computational models, the species-specific miRNAs or homologous miRNAs in different organisms can be found.For the present most of the machine learning computational methods for pre-miRNA prediction are based on two-class SVM and use structural information of pre-miRNA hairpins. Those methods share a common feature that all of them need a negative dataset in the training dataset and feature selection in both training and testing dataset. In order to avoid selecting false negative examples of miRNAs hairpins in the training dataset which may mislead the classifiers, we present a microRNA prediction algorithm called MirBio basing on miRNA biogenesis which is trained only on the information of the positive miRNA class to predict miRNAs and we used one-class SVM for our method. It can predict both pre-miRNA and miRNAs and get a relatively satisfying result in this study.The new algorithm in the paper divides the secondary structure sequence of pre-miRNA into three parts according to the maturation process biogenesis of miRNAs and extracts structural and sequential feature of each part sequence and cutting sites, totally receive 1044 feature vector:a) Free Energy includes minimum free energy and the free energy of the thermodynamic ensembles resulting from the primary hairpin structure prediction.b) The occurrence number of each nucleotide (A, U, G, C) in mature miRNA segment.c) Cutting site conservation is 5-nt continuous short sequence in both ends of the mature miRNA and structure information at the cutting sites of each mature miRNA.d) Loop and stem size as the number of unpaired bases in the tail, mature and hairpin part of the predicted secondary structure of pre-miRNA.e) GC Content is defined as the fraction of G and C nucleotides in the structure prediction.f) The number of 2-nt overhangs from 5′site and 3′site to loop start. Before extracting feature from miRNA sequences, we need to predict the secondary structure of miRNAs by RNAfold, besides, SVM demands that each data example is represented as a vector of real numbers. Hence, if there are categorical attributes, we have to convert them into numeric data.The secondary structures of pre-miRNA are depicted as sequences with brackets and dots; we introduce a new representation for the secondary structures of RNA. First of all, we store the length of overhangs in flanking regions from 5′site and 3′site to loop start separately ,then for the secondary structure of pre-miRNA selected has no pseudo knot, we can divide the secondary structure into four kinds of structure which are stem, inner loop, upper-loop and lower-loop, and encoding them as"0","1","2","3"separately, The sequence depicts the secondary structure intuitionally and is much shorter than the length of the dot-bracket sequence. In addition, for the1024-dimensional vector which represents the occurrence number of all possible 5-grams from each pre-miRNA sequence, we introduce a new algorithm——Qua-Dec to get the numerical feature. In the algorithm, we employ 5-bit quaternary code to denote the serial number of 1024-dimensional vector. For there are 1024 permutation totally, the serial number varied from 0 to 1023.We download all 706 human miRNA sequences from miRBase Release 13.0 as positive examples. To ensure that all candidates are folded as hairpins, we remove a few ones whose secondary structures contain no or more than one RNAfold predicted hairpin-loop or the size of the hairpin-loop is less than 4, and the size of the sequence is more than 100, besides that, we also get rid of the new data verified by computational methods. Finally, we get a dataset with 495 positive human miRNA sequences.In our experiment, we use the F1 measure, the recall and the precision to values the predicting result. In order to demonstrate the importance of the structure and sequence information based on the maturation process of miRNAs, we make a comparison with the prediction result only based on different parameters and features, as well as two-class SVM and one-class SVM algorithm, because of the training data set with only positive examples, the one-class SVM algorithm is more accuracy and sensitive.From all above, we can come to a conclusion that effective features and reasonable construction of one-class classifier is instrumental in improving generalization capability and sensitivity of classifier, besides, the method in this paper is more sensitive and specific than other computational methods in some way.
Keywords/Search Tags:miRNA, one-class SVM, miRNA biogenesis, hairpin, RNA secondary structure
PDF Full Text Request
Related items