Font Size: a A A

Research On Relevant Problems Of Protein Subcellular Localization Prediction

Posted on:2007-07-31Degree:DoctorType:Dissertation
Country:ChinaCandidate:Q B GaoFull Text:PDF
GTID:1100360215970507Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
One important task in the research of proteomics is to explore the natural principle of proteins in performing and regulating the activities of an organism at cell level, and to study the relationship between protein function and their subcellular localization and surroundings. Thus, researchers can understand the intrinsic characteristics about cell activities more clearly. As predicting protein subcellular localization may supply some important information about protein function, it now has become a hot topic in bioinformatics. Focused on this topic, this dissertation refers to studies on protein sequence encoding, feature selection, developing classification algorithms and recognizing the cleavage sites of signal peptides. The main contents and contributions of the dissertation are summarized as follows:(1) The research on protein sequence encoding schemesSequence encoding is the basis for subsequent analysis by various computational algorithms and has a critical influence on the prediction performance of a system. This process is helpful for mining biologically useful knowledge from protein dataset. At present, many different methods have been proposed, and most of them are based on the synthesis of multiple feature resources. But till to now, none of them are found to be very effective for characterizing proteins. Therefore, this research has constructed a hybrid feature set to encode protein sequences by using auto-correlated functions and 10 physicochemical properties of amino acids. The features of amino acid composition and dipeptide composition are also incorporated in this set. Based on these features, an AAindex-based method for protein subcellular localization prediction is suggested. Auto-correlated function is a measure for protein feature based on amino acid index, which not only calculates the coupled effect of amino acids, but also considers the sequence length of proteins. Thus, it can capture some additional information missed by amino acid composition and dipepitide composition. Using the proposed encoding method, we predict protein subcellular localization via nearest neighbor algorithm. The experimental results show that the present method achieves satisfactory performance. Compared with other methods, the present method also has great competitiveness. This indicates that this method is efficient and effective.(2) The research on protein feature selectionWhether the training of a classifier or the recognition of an unknown sample, researchers are required to extract some proper features to represent samples. However, in some cases the dimensionality of feature set is very high. If all features are used without selection, the recognition speed will be slowed down and the recognition ratio might be degraded, or even the system will be puzzled by the problem of curse of dimensionality. Therefore, it is essential to perform feature selection before classification. This research investigates the prediction of protein subcellular localization and classification of GPCRs using feature selection techniques, and proposes a SVMs-based filter approach and wrapper approach for protein feature selection, respectively. Then we use the selected features to classify protein sequences. The purpose of feature selection is to eliminate those irrelevant or redundant features and find a more compact feature set, and thus to increase the comprehensibility of prediction results. The experimental results show that the proposed methods select features that improve the prediction speed and performance of the system. This demonstrates their effectiveness.(3) The research on algorithms for protein classificationComputational algorithm plays an important role in each branch of bioinformatics. For the same dataset and feature set, the choice of algorithm may influence the prediction results dramatically. Instance-based learning methods, such as nearest neighbor algorithm, are often used in machine learning research. But in some practical applications of bioinformatics, training samples are often very limited. Therefore the performance of nearest neighbor is restricted. On the basis of nearest neighbor, in this research we introduce two novel pattern classification algorithms, i.e. the nearest feature line and tunable nearest neighbor methods, and use them to predict protein subcellular localization. The prediction results show that they both achieve higher recognition ratio than nearest neighbor method. This demonstrates their effectiveness, especially for the small sample problems. They can improve the prediction performance of the system by expanding the representational capacity of the available samples.The main drawback of the two methods is their high computational complexity. So, they are not suitable for large sample size problems. To shorten the computation time, we propose a center-based nearest neighbor method. Compared with nearest feature line method, this method can decrease the computational complexity dramatically, but not obviously degrade the recognition ratio. When applied to the prediction of protein subcellular localization, our method also achieves higher recognition ratio than nearest neighbor method. This demonstrates its effectiveness.(4) The research on methods for recognizing the cleavage site of signal peptidesSignal peptides control the entry of virtually all proteins to the secretory pathway, both in eukaryotes and prokaryotes. They comprise the N-terminal part of the amino acid chain and are cleaved off while the protein is translocated through the membrane. Since there are numerous unsettled proteins in the public databanks, the automatic recognition of signal peptides and their cleavage sites is interested in by many researchers. This research studies the recognition of cleavage site of signal peptides in Escherichia coli using a HMM-based method. In the prediction procedure, we use the statistical properties of signal peptides and the coupling rules of amino acids adjacent to cleavage site. Combining this biological knowledge with the constructed HMM, we add a filtering process to improve the recognition ratio. The experimental results show that our method has reached an overall accuracy of 85.6% in the LOOCV test. This demonstrates its effectiveness.
Keywords/Search Tags:Proteomics, Bioinformatics, Protein subcellular localization, AAindex encoding, Feature selection, Support vector machine, Instance-based learning
PDF Full Text Request
Related items