Font Size: a A A

Research On Non-coding Rna Mining Based On Secondary Structure

Posted on:2010-12-25Degree:DoctorType:Dissertation
Country:ChinaCandidate:Q ZouFull Text:PDF
GTID:1118360332957818Subject:Artificial Intelligence and information processing
Abstract/Summary:PDF Full Text Request
Non-coding RNA is one of the most important topics in bioinformatics. The research of non-coding RNA has been voted as top ten scientific progresses for several years recently, and it won the Nobel Price in 2006. More and more bioinformatics researchers devote themselves to mining non-coding RNA and analyzing the function. However, the efficiency of the current mining method is low and the false positive is high. So in this thesis, I develop the secondary structure prediction algorithm, improve the machine learning method for imbalanced data, and do deep research on mining non-coding RNA.The contributions of the dissertation are as follows:(1) Three strategies are proposed for class imbalance learning problems in bioinformatics.There are many class imbalance learning problems in bioinformatics. It is because of the native distribution and that positive samples always spend much more than the negative ones. A novel classification method is proposed for training class imbalance data, such as identifying snoRNA, classifying microRNA precursors from pseudo ones, mining SNPs from EST sequences, etc. The method is based on the main idea of ensemble learning. First, the negative set (big class) is divided randomly into several subsets equally. Every subset together with the positive set is a class balance training set. Then several different classifiers are selected and trained with these balance training sets. After the multi-classifiers are built, they will vote for the last prediction when facing new samples. In the training phase, a strategy similar to AdaBoost is used. For each classifier, the samples will be added to the next two classifiers'training sets if they are misclassified. This strategy can improve the performance of weak classifiers by voting. Five UCI data sets and three bioinformatics experiments prove the performance of our method. Furthermore, a software program, named libID, is developed.(2)"Centriod of helix"is proposed firstly as a novel concept in this thesis, and two novel algorithms are developed based on this concept.RNA secondary structure can not be compared quickly by current representation. In this thesis, a novel concept"centroid of stem"is proposed for discribing the position of the stem, and more novel concepts, such as"distance between centroids","D function", are extended for measuring the difference between secondary structure. The comparative sequence analysis method and the minimum free energy method are both improved based on these novel concepts. For comparative sequence analysis method, a novel prediction algorithm is proposed independent of multiple sequence alignment; for minimum free energy method, the prediction performance is improved by involving the class information.(3) Research and key problems on mining microRNA are discussed deeply.Homologous searching and ab initio predicting are two methods for mining microRNA. Homologous searching is the main method currently. In this thesis, a novel searching method based on keywords tree is proposed, for saving the time cost and maintaining the sensitivity at the same time. The application on soybean and silkworm proves the performance of our method. Ab initio prediction is based on machine learning and will be the main mining method in the future. It can find new microRNA family, however, localization of mature part is the bottleneck. In this thesis, I discuss this problem with two points of view. Although I havn't solved this problem completely, my work has done help on the further research.(4) Algorithm on mining snoRNA is developed based on the secondary structure prediction and class imbalance learning methods mentioned above.SnoRNAs are mined based on targets information currently. As the development of function, especially as the discovery of"orphan snoRNA", ab initio mining methods is noticed and researched since the independent of targets information. In this thesis, we propose a novel ab initio snoRNA gene mining algorithm, which is based on ensemble learning and a special secondary structure prediction algorithm. Three contributions are made to improve current mining methods, including enriching the negative training set, using the ensemble classifiers for the class imbalance data, and developing a special secondary structure prediction algorithm for extracting features with high quality, which is the first time to our knowledge. The performance of learning method is proved by cross validation and the mining method is proved by the experiments on genome data.
Keywords/Search Tags:non-coding RNA, data mining, RNA secondary structure prediction, class imbalance classification problem, microRNA precursor, bioinformatics
PDF Full Text Request
Related items