Font Size: a A A

The Application Of Support Vector Machine To Operon Prediction

Posted on:2009-03-19Degree:MasterType:Thesis
Country:ChinaCandidate:X M WangFull Text:PDF
GTID:2120360242480622Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of Human Genome Project and Proteomics, bioinformatics is introduced and highly developed. Bioinformatics is interdisciplinary, which is constructed on the basis of Mathematics, Computer Science and Life science. It introduces the method and tools in Mathematics, Computer Science and Life Science as means of obtaining, storing, analyzing and transferring the bio-information, which will help on the understanding biomedical experimental data.Recently, the data source of life science explodes as the development of sequence determination, gene recombination and gene chip, no matter in term of the quantity or quality. At the same time, the technology's innovation of computer science and internet enable the storage, handling and transformation of the huge data set. Furthermore, the emphasis and breakthrough point of life science have shifted from the accumulation of the data to verifications of them through experiments. So bioinformatics as a tool to handle a large scale of data means a lot to the development and research in life science.As the greatest achievement of bioinformatics methods, the Human Genome Project is nearly completed. People often talk portentously that we are living in the "post-genomic" era. In this era, there will be a general shift in emphasis (of sequence analysis especially) from genes themselves to gene function and regulation, and the core research will include such as Genome Diversity, Genome Regulation and the corresponding proteomics product's function and Model Organism Genome etc.Operon prediction is most crucial in constructing gene regulatory network and studying the whole genome. The research on operon prediction has been well developed with the methods such as Bayesian, Neural network, Genetic Algorithm etc. Besides, operon prediction can also provide valuable information to bio-pharmacies, protein function research and regulation mechanism.In this paper, operon prediction problem has been elaborates firstly, and then the current status of research on the problem are reported. Also, not only the methods applied on operon prediction are described, but also their pros and cons are analyzed. The feature data used in this paper are abstracted from the genome data, and pre-processed are proceeded. When computing the phylogenetic profile, Symphony, as a kind of implementation of grid computing, are used to construct a cluster, in which, a parallel computing of blastp are running. As we use Support Vector Machine(SVM)as a tool to model a classifier, the theory of it are generally stated. With the guild of it, we use the tool box of Matlab, LS-SVM to constructed a classifier and train it. The experiment result turn out that, the classifier has excellent learning ability and wonderful generalization ability. In term of the classifying result, it is stable and has a satisfying sensitivity, specifity and accuracy.The paper mainly works on the following subject:1. Feature data abstract and pre-processFeature data abstract and pre-process is part of key in open prediction. In this paper, the feature data used are intergeneric distance, phylogenetic profile and gene expression data obtained during the micro array experiment. If two genes locate on the same strand and the second gene's start followed with the first gene's end, then their distance is computed out with the start and end information in genome data, the pre-process of genetic distance is on the basis of entropy. The phylogenetic profile of a protein is a string of binary code (0 or 1), of which each bit represents the presence or absence of its homolog in the reference organisms. It can be computed with sequence comparison using BLAST software. Both Hamming Distance and entropy-based distance of phylogenetic profile are evaluated. In the condition of single input, the predicted result with Hamming Distance show better performance. Gene expression data is the result of micro array experiment, applying Pearson Correlation Coefficient to abstract the co-regulation information and perform a pre-process with Wavelet Transform.In term of preprocess on the feature data, both evaluated in the condition of single input and multi ones, and it turns out that the pre-processes make the classifier work better.2. Grid computing support on computationDuring computing the phylogenetic profile, there could be sequence comparison between E.coli and other hundreds of genome.In consideration of the limitation of single host's computation ability, a cluster is constructed with the help of Symphony, a product of grid computing. Within this cluster, Blast software is integrated and can be running in parallel mode on multi host, which speed up the data process.3. SVM classifierSVM related problems are discussed within this paper, and also the Statistical Learning Theory (SLT) is generally described. Linear and no-linear separable case are separately stated,and finally a quadratic programming (QP) problem with an equation constrain and non-equation constrains are conveyed.Applying SVM on operon prediction, there is a need to optimize the classifier in the condition of multi inputs to make it have better performance, through which to get the balance of model complex and generalization ability, low down both the empirical risk and the inaccuracy in prediction.4. Result analysesIn this paper, leave-one-out cross validation are used to evaluate the prediction result. In the condition of single input, the efficiency of pre-process is validated and with multi inputs, each feature data's contribution to operon prediction is curved out.With the explosion of bio-information, the computation ability needs to be strengthened too. But the expensive large scale server can not be afforded within science research, grid computing comes at the right time, which will supply a full solution to this problem. Besides, the solution will manage all the resources within the cluster and allocate resources to what the application really needs; this will take fully advantages of unused resources to do more work.The classifier, constructed with the guide of SVM, has excellent learning ability and wonderful generalization ability.Extended research still has more problems to resolve, such as feature data chosen and pre-process method, classifier's optima ion and trained model prediction on unknown genome etc.
Keywords/Search Tags:Application
PDF Full Text Request
Related items