Font Size: a A A

Research On MicroRNA Identification Algorithm And Disease Related MicroRNA Prediction Algorithm

Posted on:2013-11-21Degree:DoctorType:Dissertation
Country:ChinaCandidate:P XuanFull Text:PDF
GTID:1260330422952095Subject:Artificial Intelligence and information processing
Abstract/Summary:PDF Full Text Request
MicroRNAs (miRNAs) are a set of short (about22nucleotides) non-coding RNAsthat play significant regulatory roles in various biological processes of animals andplants. Furthermore, accumulating evidence indicates miRNAs are associated withvarious human diseases. The application of bioinformatics in miRNA research greatlypromotes the development of this cutting-edge area of current biology. In this thesis, westudied pre-miRNA classification, mature miRNA position prediction, anddisease-related miRNA identification. The creative work mainly consists of thefollowing four parts.(1) A novel classification method based on support vector machine (SVM) isproposed specifically for predicting plant pre-miRNAs.Identification of miRNAs is the first step in miRNA functional studies. DetectingmiRNAs by experimental techniques is expensive and time-consuming. It is difficult toidentify the lowly expressed miRNAs or the miRNAs that expressed in the specifictissues or expressed in developmental stage. Therefore, computational predictionmethod can provide the potential pre-miRNA candidates for the biologists. Consideringthe characteristics of pre-miRNAs, the classification method based on SVM is proposed.It is well studied that the good features and positive/negative (real/pseudo pre-miRNA)datasets are the basis of constructing efficient classification model. Therefore, thesequence-related features, structure-related features, and energy-related features areextracted from the real/pseudo plant pre-miRNAs. A set of informative features areselected by our feature selection method based on genetic algorithm. Due to lack ofpseudo plant pre-miRNAs, we extract the pseudo hairpin sequences from the proteincoding sequences of Arabidopsis thaliana, Oryza sativa, and Glycine max respectively.These pseudo hairpin sequences are used as negative samples. Considering the classimbalance of real/pseudo pre-miRNAs, the classification model (PlantMiRNAPred) isconstructed by combining ensemble learning and AdaBoost method. PlantMiRNAPredachieves more than90%accuracy on the plant datasets from8plant species, includingArabidopsis thaliana, Oryza sativa, Populus trichocarpa, Physcomitrella patens,Medicago truncatula, Sorghum bicolour, Zea mays, and Glycine max. PlantMiRNAPredhas important value in identifying plant pre-miRNAs. In addition, we construct aclassification model, HumanMiRNAPred, with the data of human pre-miRNAs.HumanMiRNAPred achieves higher prediction performance, which is helpful forfacilitating identification of human pre-miRNAs.(2) A machine learning method based on support vector machine is proposed topredict the positions of miRNAs for the new pre-miRNA candidates.Most of pre-miRNA classification methods based on machine learning can distinguish real pre-miRNAs from pseudo pre-miRNAs, and few can predict thepositions of miRNAs. However, to efficiently identify the actural miRNAs, thepositions of miRNAs usually should be given for the subsequent biological experiments.Therefore, the position prediction method is proposed. First, a miRNA:miRNA*duplexis regarded as a whole to capture the binding characteristic of between a miRNA and itscorresponding miRNA*. Second, we extract the features from real/pseudomiRNA:miRNA*s and select the informative features to improve the predictionaccuracy. Third, two-stage sample selection algorithm is proposed to combat the seriousimbalance problem between real miRNA:miRNA*s and pseudo miRNA:miRNA*s. Therepresentative negative training samples (pseudo miRNA:miRNA*s) are selectedaccording to their distribution density in the high dimensional sample space and theirprediction deviations. The prediction method, MaturePred, achieves higher predictionaccuracy compared with the existing methods. MaturePred can provide the morereliable animal miRNA candidates and plant miRNA candidates for subsequentexperiments.(3) On the basis of accurately measuring the functional similarity of two miRNAs,the method based on the k most similar neighboring miRNAs is proposed for predictingdisease-related miRNAs.The abnormal expression of miRNAs is one of important causes which result invarious diseases. Therefore, the identification of human disease-related miRNAs isimportant for investigating their involvement in the pathogenesis of diseases. It isknown that miRNAs with similar functions are often associated with similar diseasesand vice versa. Therefore, the functional similarity of two miRNAs has beensuccessfully inferred by measuring the semantic similarity of their associated diseases.We achieve more accurate measurement of miRNA functional similarity by consideringthe information content of disease terms. A new prediction algorithm, HDMP, based onthe k most similar neighboring miRNAs is presented for predicting disease-relatedmiRNAs. In addition, the miRNAs that belong to a miRNA family or locate a cluster aremore similar with each other. We furthermore propose the prediction algorithm based onthe information of miRNA family or cluster. The algorithm is referred to as HDMPW.HDMP and HDMPW were proved successful in predicting the potential disease-relatedmiRNA candidates for18human diseases. HDMP can be easily extended to otherdiseases with the rapid increase of miRNA-disease association data for specificdiseases.(4) On the basis of constructing miRNA functional similarity graph, a methodbased on random walk is proposed for predicting disease-related miRNAs.The miRNA functional similarity graph is constructed by calculating the functionalsimilarity of two miRNAs. The prediction algorithm based on random walk with restart,HDMPR, is proposed for predicting disease-related miRNAs. Unlike HDMP and HDMPW, HDMPR does not consider the k most similar neighboring miRNAs, butrather it considers the global structure of miRNA functional similarity graph. Theefficiency of HDMPR is validated by the association data of18human diseases. Theexperimental result indicates that HDMPR achieves higher prediction performance thanHDMP and HDMPW for most of the18diseases. Overall, HDMP, HDMPW, andHDMPR are useful in providing reliable disease-related miRNA candidates forsubsequent biological testing.
Keywords/Search Tags:pre-miRNA, mature miRNA, genetic algorithm, feature selection, classimbalance, sample selection, disease-related miRNA, random walk
PDF Full Text Request
Related items