Font Size: a A A

Research On Some Key Problems In The Prediction Of RNA Secondary Structure Based On Comparative Sequence Analysis

Posted on:2019-02-04Degree:DoctorType:Dissertation
Country:ChinaCandidate:T H LiuFull Text:PDF
GTID:1360330611993001Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Bioinformatics is a new discipline formed by the intersection of life science and mathematics,computer science and other disciplines in 1980 s.The study of RNA has always been a very important research direction in bioinformatics,and its popularity is increasing.More and more studies have shown that RNA can not only be used as a carrier of genetic information,but also has a variety of important functions.It has been proved that the function of RNA is closely related to its structure.In order to better explore the function of RNA,it is necessary to study its structure.Because RNA molecules have the characteristics of fast degradation and difficult crystallization,it is very expensive and time-consuming to determine the structure of RNA by conventional experimental methods,such as nuclear magnetic resonance or X ray crystal diffraction,which can not meet the needs of mass data analysis.The RNA secondary structure prediction is an important intermediate step in RNA tertiary structure prediction.Using computer and mathematical methods to predict the secondary structure of RNA is the main method to study the structure.The primary RNA secondary structure prediction methods include dynamic programming method,comparative sequence analysis method,combinatorial optimization method,heuristic method,machine learning method and so on.The main work of this paper is to study some key problems in RNA secondary structure prediction based on comparative sequence analysis.At the same time,a fast RNA secondary structure prediction model is proposed,and it is optimized in several aspects.The research contents include the following aspects:The first,in order to solve the problem of high consumption of computing resources,a fast RNA secondary structure prediction model based on extreme learning machine is proposed.The comparative sequence analysis method is the most accurate method in the RNA secondary structure prediction.The extreme learning machine is a new kind of machine learning method.It has the advantages of simple model,less manual intervention and fast training speed.In this paper,the comparative sequence analysis method is combined with the extreme learning machine.A fast RNA secondary structure prediction model based on extreme learning machine is proposed.The model considers the RNA secondary structure prediction as a binary classification problem.It is divided into three parts,which are sample set construction,model training and structure prediction.The experimental results show that the model has high prediction accuracy and fast training and prediction speed.The second,aiming at the problem of imbalanced data,a hierarchical processing scheme for imbalanced data based on clustering under-sampling and ensemble learning is proposed.The scheme combines clustering method,sampling method and ensemble learning method.It is divided into two layers.The first layer is a under-sampling training sample selection method based on K-means clustering,which is the optimization of sample selection part in the model.The second layer is a model training method based on the asymmetrical weight allocation Adaboost,which is the optimization of the algorithm design part of the model.The first layer has fast computing speed,good scalability and is able to eliminate noise interference;the second layer has high prediction accuracy.The two methods can be used together,or can be used individually according to the actual situation.It is a flexible,fast and effective scheme for processing imbalanced data in RNA secondary structure prediction,and has a certain generality.The experimental results show that the scheme can effectively solve the problem of imbalaced data,and improve the prediction accuracy.The third,aiming at the problem that the optimization of feature extraction is not enough,a feature extraction method combining the neighboring column pairing information and the principal component analysis is proposed.After the analysis of the continuity of the stem,the method of extracting features from the neiboring columns pairing information is proposed.In addition,in order to solve the problem of overfitting caused by sparse samples in the process of increasing the number of features,the methods of feature selection and feature reduction is compared,and principal component analysis is used.The component contribution rate is used to select the distance of neighboring column pairing information.The experimental results show that this feature extraction method can further improve the prediction accuracy.The fourth,in order to solve the problem of limitation of the sequence length,a method of RNA sequence alignment division based on heuristic searching for stem is proposed.The method stipulates the principle of sequence division,and designs the evaluation function based on the covariation score and the fraction of complementary bases.It uses a heuristic strategy to search the ”significant” stem,and then divides the sequence according to the locations of the significant stems,so that the stems can not be divided into different subsequences.A new strategy of stem updation based on ”stem table” is proposed.The experimental results show that this method can speed up the prediction and improve the prediction accuracy to a certain extent,and it doesn't limit the length of the alignment.
Keywords/Search Tags:RNA secondary structure prediction, comparative sequence analysis method, extreme learning machine, imbalanced data, K-means clustering, ensemble learning, neighboring column pairing information, principal component analysis, heuristic search
PDF Full Text Request
Related items