Font Size: a A A

Research On Biological Sequence Analysis Method Based On Machine Learning

Posted on:2018-04-01Degree:MasterType:Thesis
Country:ChinaCandidate:H WuFull Text:PDF
GTID:2370330566998428Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In bioinformatics area,dealing with biological sequence analysis problems based on machine learning contains three main steps: feature extraction methods of biological sequences,predictor construction based on machine learning methods,and performance evaluation.However,it costs much for a researcher without computer related knowledge to propose a sequence analysis method.Thought many web-servers or stand-alone tools have been developed to implement the complete or part of the process of model construction,the functions they provide are limited.In order to fix the aforementioned gaps,this thesis investigated the biological sequence analysis problems,and proposed three methods to solve three biological sequence analysis problems respectively.At last,this thesis proposed a general analysis platform to handle bioi nformatics problems.In this study,the steps to deal with biological sequence analysis problems based on machine learning has been investigated systematically.The typical feature extraction methods for biological sequences have been explored and classified.Some machine learning methods popular in bioinformatics for predictor construction have also been studied include support vector machine,random forest,etc.Furthermore,we investigated the methods for predictor performance evaluation like cross validation and bootstrapping.The performance measures have also been included within the research scope of this thesis.Through the study of this part,the keys steps to deal with sequence analysis problems have been clarified,which provide a systematical theoretical basis for sequence analysis methods investigation and biological sequence analysis tool development.Based on the investigation of biological sequence problems,this thesis proposed three methods for three biological sequence analysis problems respectively.For DNase I hypersensitive site prediction problem,a multi-feature integration method was proposed.This method integrated three feature extraction methods to generate the feature vectors.Use feature s election method to erase redundant features.Then construct a predictor based on support vector machine and evaluate its performance.For micro RNA precursor prediction problem,this thesis proposed a multi-feature ensemble method which use ensemble learning based on three predictor constructed by three different feature extraction methods containing different kinds of features.For DNA-binding protein prediction problem,a method based on ensemble learning was proposed.In this method,we firstly improved the Distance Pair method by incorporating more evolution information based on protein frequency profile.Then to further enhance the performance,ensemble learning was used by combining this method with another sequence information based method.By experiments and results analysis,these three methods achieve better results compared with the state-of-the-art methods of the corresponding area,which show a better application prospect.In the meantime,these methods also indicated the significant role of biolo gical sequence analysis investigation in solving sequence analysis problems.It is a key point for dealing with biological sequence analysis problems to convert the theoretical investigation into practical tools.A general biological sequence analysis platform based machine learning has been developed.This platform contains many typical methods for biological sequences feature extraction.Several machine learning algorithms have been included in the platform.Several predictor performance evaluation methods have also been included.Besides,feature selection methods and imbalance dataset handling strategies have also been implemented.This platform is a general biological sequence analysis platform with comprehensive functions which can be used to handle different kinds of biological sequence analysis problems.
Keywords/Search Tags:biological sequence analysis method, feature extraction, machine learning, biological sequence analysis platform
PDF Full Text Request
Related items