Research On Support Vector Machine Technology In Biologic Data Analyses

Posted on:2008-10-07

Degree:Doctor

Type:Dissertation

Country:China

Candidate:J L Liu

Full Text:PDF

GTID:1118360215994799

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Support vector machine (SVM) is new technology of data mining, which is based on statistical learning theory. SVM solves complicated machine learning problems by using optimization method. It is powerful for the problem with small sample, nonlinear, high dimension and local minima, and is of well generalization ability. It can suffer from overfitting problem and curse of dimensionality problem.In the 21st century, life science is experiencing rapid development, biologic data has exponentially increased. Therefore, life science scientists focus on analyzing and mining the data to find biologic patterns. The study on functional regions in human genome is an important research field. For a given DNA sequence segment, determining it is intergenic or gene region is essential precondition for further analysis. Developing effective analytical algorithms has become one of the important means for speeding up analyzing and understanding biologic information. There are many gene prediction softwares now, but most of them can not predict whole gene structures.This dissertation mainly aims at applying SVM and some other machine learning methods to biologic data classification, systematically studies classification technology for biologic data based on statistical learning theory. Simultaneously, performance comparisons and evaluation between different learning methods are presented.The great contribution of statistical learning theory is structural risk minimization (SRM) principle and support vector machine (SVM) learning method based on SRM principle. SRM has been shown superior to traditional empirical risk minimization (ERM) principle. SRM minimizes the sum of expected risk and confidence interval, as opposed to ERM that minimizes the error on the training data. This difference shows that SVM has greater generalization ability, which is the goal in statistical learning.The first step of machine learning is exactly extracting training attributes with classification features from very long DNA sequences. Considering complexity of DNA data, a feature extraction method based on linguistics is introduced. Suppose 2-class problems are considered here. All short sequences with length of 2 to 6 are regarded as candidate training attributes. For each candidate training attribute, calculate the frequency it appears in every functional sequence in the whole DNA sequence set, the frequency in DNA sequence set, and the relative difference for two class sets, which determine whether to choose the candidate training attribute as training attribute. Consequently, functional sequences are mapped to Euclidean space, and each functional sequence is corresponded with a vector in Euclidean space.Based on SVM, this dissertation has developed software for whole gene recognition which prediction accuracy reaches 85% above without any knowledge from biologic field. In addition, the valid experience for selecting training parameters is provided. In the case of lacking enough understanding for training data, changing penalty factor C for C-SVC from big to small can get optimal result more rapidly.This dissertation verifies advantages of SVM training method by comparing with other learning methods. For DNA sequence classification problems, comparisons between SVM and Binary Logistic Regression (BLR) learning, and SVM and artificial neural network (ANN) show that SVM has better classifier accuracy and needs less training time.This dissertation discusses Parallel SVM briefly. Genetic algorithm (GA) is introduced to Parallel SVM to utilize the inherent parallelizable features of SVM and GA.This dissertation applies SVM to biologic data classification and receives well experience result. It provides basis for using the method to solve other classification for biologic data since biology data has similarity on the whole and diversity and complexity for individual. It also provides basis for using SVM learning method in other application to solve complex classification problem.

Keywords/Search Tags:

machine learning, pattren recognition, support vector machine, DNA sequence, parallel computing

PDF Full Text Request

Related items

1	Research On Some Problesm Of Support Vector Machine Learing Algorithm
2	Study On Application Of Machine Learning Based On Support Vector Machine
3	Design Of Support Vector Machine Accelerator Based On Reconfigurable Computing Platform
4	ThunderSVM:A Fast Parallel Support Vector Machine Library
5	Research Of Parallel Computing Apply In Support Vector Machine Under MATLAB Cluster
6	Research On Non-parallel Support Vector Machines For Noise Classification
7	Research On Network Traffic Classification Technology Based On Support Vector Machine
8	Research For Non-linear Support Vector Machine Classification Algorithm Based On MapReduce
9	Some Empirical Research On Statistical Machine Learning
10	Study Of Support Vector Machine And Its Application In Cancer Diagnoses