Font Size: a A A

The Study On Classification And Prediction For Biological Protein Sequential Data

Posted on:2008-10-11Degree:DoctorType:Dissertation
Country:ChinaCandidate:H LiuFull Text:PDF
GTID:1100360215476849Subject:Pattern Recognition and Intelligent Systems
Abstract/Summary:PDF Full Text Request
Sequential data is a kind of special data in data mining, and widely exists in diverse fields. How to extract or mine knowledge from large amounts of sequential data is a new research topic, and has theoretical and practical importance. In this paper, we study the classification and prediction for sequential data, especially for biological sequence.In this dissertation, we focus on the topic of classification and prediction for biological sequence. With many analysis algorithms for sequential data, we summarize them as two kinds of methods, algorithms based on feature extraction or those on similarity. On one hand, based on feature extraction method, extract the different features for different kinds of sequence, membrane proteins and signal peptides. On the other hand, we propose similarity based on global alignment for prediction, and then embed the similarity into kernel space to improve the stability. With these methods, the feature vector can be got and the method based on similarity is united with the feature extraction method. Feature reduction is also studied and sequential data can be visualized. The innovative ideas in this dissertation are as follows:(1) Based on the traditional pattern recognition algorithm, extract various features according to different sequence and then train classifier for predicting new samples. For membrane proteins, first encode them as discrete-time series sampled by different sampling interval, and then analysis the series by digital signal processing theory. This method avoids the loss of sequence-order information as other algorithms did. In the frequency domain, we extract low-frequency feature, magnitude as well as phase, to represent the main series information and decrease the noisy. The experiment illustrates the performance of feature extraction by low-frequency spectrum for predicting membrane protein types.(2) For the short sequences, such as signal peptides, sliding-window is adopted to transform diverse-length sequences to length-fixed segments and complex coupling affect is found by mutual information, while many former algorithms just blindly simplify that information. Then, the multi-decision tree is proposed to extract statistical rulers for predicting signal peptides and their cleavage sites. Promising result is got in the experiment.(3) Taking similarity as foundation for classification, we defined the similarity model based on global alignment, and avoid the shortcomings of sliding-window methods, such as imbalance problem. By analyzing the mathematical characteristic, the similarity is proved to be a kind of measurement. When applied to predicting signal peptides, the similarity gets the stable high prediction rate. The result demonstrates the defined similarity can well represent the relationship between sequences and provide a suitable form for them. On-line bioinformatics web server is also available for promoting the development of biology science.(4) Study on the indefinite kernel. Fristly, analyzing the different between the similarity based on global alignment and traditional Euclidean distance. We proposed indefinite kernel algorithm and apply it to predict the signal peptides. On the other hand, the feature vector can be got with high prediction rate and the method based on similarity is united with the feature extraction method. Experiment proves the performance.(5) Study on reducing the data dimension and extract the useful features for classification. Making full use of the null space of within-class scatter matrix, we propose Separated Space based Linear Discriminant Analysis(SSLDA) and avoid the unstability of traditional LDA. For signal peptides, with the high-dimension got by indefinite kernel based on global alignment similarity, we apply SSLDA and get reduced dimension. And sequential data can also be visualized...
Keywords/Search Tags:pattern recognition, bioinformatics, sequential data, classification and prediction, feature extraction, similarity, linear dimension reduction, digital signal processing, mutual information, decision tree, membrane protein, signal peptide
PDF Full Text Request
Related items