Font Size: a A A

Research On Support Vector Machine Technology In Biologic Data Analyses

Posted on:2008-10-07Degree:DoctorType:Dissertation
Country:ChinaCandidate:J L LiuFull Text:PDF
GTID:1118360215994799Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Support vector machine (SVM) is new technology of data mining, which is based on statistical learning theory. SVM solves complicated machine learning problems by using optimization method. It is powerful for the problem with small sample, nonlinear, high dimension and local minima, and is of well generalization ability. It can suffer from overfitting problem and curse of dimensionality problem.In the 21st century, life science is experiencing rapid development, biologic data has exponentially increased. Therefore, life science scientists focus on analyzing and mining the data to find biologic patterns. The study on functional regions in human genome is an important research field. For a given DNA sequence segment, determining it is intergenic or gene region is essential precondition for further analysis. Developing effective analytical algorithms has become one of the important means for speeding up analyzing and understanding biologic information. There are many gene prediction softwares now, but most of them can not predict whole gene structures.This dissertation mainly aims at applying SVM and some other machine learning methods to biologic data classification, systematically studies classification technology for biologic data based on statistical learning theory. Simultaneously, performance comparisons and evaluation between different learning methods are presented.The great contribution of statistical learning theory is structural risk minimization (SRM) principle and support vector machine (SVM) learning method based on SRM principle. SRM has been shown superior to traditional empirical risk minimization (ERM) principle. SRM minimizes the sum of expected risk and confidence interval, as opposed to ERM that minimizes the error on the training data. This difference shows that SVM has greater generalization ability, which is the goal in statistical learning.The first step of machine learning is exactly extracting training attributes with classification features from very long DNA sequences. Considering complexity of DNA data, a feature extraction method based on linguistics is introduced. Suppose 2-class problems are considered here. All short sequences with length of 2 to 6 are regarded as candidate training attributes. For each candidate training attribute, calculate the frequency it appears in every functional sequence in the whole DNA sequence set, the frequency in DNA sequence set, and the relative difference for two class sets, which determine whether to choose the candidate training attribute as training attribute. Consequently, functional sequences are mapped to Euclidean space, and each functional sequence is corresponded with a vector in Euclidean space.Based on SVM, this dissertation has developed software for whole gene recognition which prediction accuracy reaches 85% above without any knowledge from biologic field. In addition, the valid experience for selecting training parameters is provided. In the case of lacking enough understanding for training data, changing penalty factor C for C-SVC from big to small can get optimal result more rapidly.This dissertation verifies advantages of SVM training method by comparing with other learning methods. For DNA sequence classification problems, comparisons between SVM and Binary Logistic Regression (BLR) learning, and SVM and artificial neural network (ANN) show that SVM has better classifier accuracy and needs less training time.This dissertation discusses Parallel SVM briefly. Genetic algorithm (GA) is introduced to Parallel SVM to utilize the inherent parallelizable features of SVM and GA.This dissertation applies SVM to biologic data classification and receives well experience result. It provides basis for using the method to solve other classification for biologic data since biology data has similarity on the whole and diversity and complexity for individual. It also provides basis for using SVM learning method in other application to solve complex classification problem.
Keywords/Search Tags:machine learning, pattren recognition, support vector machine, DNA sequence, parallel computing
PDF Full Text Request
Related items