Font Size: a A A

Machine Learning Methods With Applications To Bioinformatics

Posted on:2015-03-20Degree:MasterType:Thesis
Country:ChinaCandidate:Y YangFull Text:PDF
GTID:2250330428473755Subject:Applied Mathematics
Abstract/Summary:PDF Full Text Request
With the Human Genome Project (HGP) being completed, we are entering thepost-genomic era. The exponential growth of the biological data leads to that how toharvest the fruits hidden in the genomic text becomes an urgent task. This thesis studiesthe machine learning methods and their applications to Bioinformatics. The maincontents are as follows:In Chapter2, we propose a new3-D graphical representation of a DNA sequenceand prove that it has two properties:(1) there is no circuit in the graph;(2) there exists aone-to-one correspondence between a DNA sequence and the graph. Based on the3-Dgraphical representation, we characterize a DNA sequence by a12-dimensional vectorwhose components are normalized ALE-indexes of the corresponding L/L matrices. Theproposed approach is tested by the phylogenetic analysis on three datasets, and theexperimental assessment demonstrates its efficiency.In Chapter3, by means of characteristic sequences of a DNA sequence, weconstruct a32kdimensional complete word-based vector. Then we present a featureselection scheme based on rough set theory (RST) to extract the most informativek-words and use only these selected features to represent the DNA sequence. Toevaluate the effectiveness of our method, we test it by phylogenetic analysis on fivedatasets. The first one is used as a training set, by which869top ranked k-words areselected. The other four are used as the testing set respectively. The results demonstratethat the proposed method can capture more important information and is more efficientfor molecular phylogenetic analysis.In Chapter4, on the basis of the idea of chapter3, combining the positioninformation with the frequency itself, a24-dimensional feature vector is constructed forthe DNA template. Then a support vector machine (SVM) based method is introducedto help evaluate PCR result. Through the Jackknife cross-validation test, our methodachieves an accuracy of92.59%. In Chapter5, by means of the idea of chapters3and4, and taking into account theclassifications of the amino acids, physical chemical properties and the amino acidsubstitution matrix, the feature vector is constructed for a protein sequence. The nearestneighbor classifier is used as the prediction engine. We selecte two widely used datasets(ZW225and CL317) to provide a comprehensive and unbiased comparison withprevious studies of protein subcellular location. The result shows that our method iseffective.
Keywords/Search Tags:Bioinformatics, Graphical representation, k-word, Machine learning, Numerical characterization, Phylogenetic analysis, Polymerase chainreaction, Protein subcellular location
PDF Full Text Request
Related items