Font Size: a A A

The Characteristic Expression Of Protein Subcellular Localization And Classification Algorithm

Posted on:2007-11-18Degree:DoctorType:Dissertation
Country:ChinaCandidate:J Y ShiFull Text:PDF
GTID:1118360218457114Subject:Pattern Recognition and Intelligent Systems
Abstract/Summary:PDF Full Text Request
As one of the most important areas in post-genome era, proteome aims tounderstand proteins' potential roles, elucidate their interaction in a cellular context, andfurther make the corresponding functional annotation. Determination of subcellularlocation of proteins is of essence and importance to their functional annotation.However, the biological experiment of protein subcellular localization will be hard tomeet the demands. Therefore, there is a need to develop more effective methods.Based on the modern theories and methods of statistical pattern recognition, therepresentation of feature, the algorithms of classification, the multi-class classification,and the processing of imbalance dataset are studied for the prediction of proteinsubcellular localization. The main contributions are as follows:1. A feature representation, moment descriptor (MD), is proposed and theperformances of three approaches of multi-class for support vector machines (SVM) areanalyzed in the case of recognition rate, the number of support vector, the training andtesting time. With the view of statistical theory, the presented method analyses aminoacid composition (AAC) and considers the information of amino acid's position inprotein sequence, and then uses amino acid coordinate mean (AAM) and coordinatevariance (AAV) to respectively represent the expectation and variance of its position ina protein sequence. The experiments are executed to validate the presented method ontwo classical databases, and its result shows that MD can represents the information ofpositions of amino acid residues in a protein sequence more effectively.2. A feature representation, amino acid composition distribution (AACD), isproposed, and then both an imbalance index and a training algorithm by weightingpenalty coefficients are presented to analyze prediction performance of SVM on theimbalance dataset. The presented method divides a protein sequence equally intomultiple segments, and then calculates AAC of each segment in series. In this way, itcan not only show AAC of each segment, but also reflect their interaction. In theexperiments, it is shown that the information of all segments is more useful than that ofthe whole sequence and AACD can represent the interaction of several segment of aprotein sequence effectively. Besides, the presented training algorithm can lighten thenegative effect derived from the imbalance.3. A feature representation, multi-scale energy (MSE), is proposed for theunstationarity of protein physic-chemical signal. The presented method codes a proteinsequence to a digital signal by mapping all residues of the sequence to thecorresponding numerical codes of one amino acid index. Via wavelet transform based on multi-resolution analysis, the mapped signal is decomposed according to Mallatdecomposition algorithm. Consequently, the square root energy factors are calculatedand further joined to a feature vector to represent the approximation and detailinformation of the signal. The experiments are executed to validate the presentedmethod, and its results show that MSE can represent the physic-chemical property ofprotein more effectively and has less computation complexity than other methods.4. Based on multiple classifier system (MCS), a novel method for prediction ofprotein subcellular localization is introduced to deal with the case of high dimensionand disagreement of multi-feature. This method can aggregate multiple groups offeatures, fuse the complementary information of patterns, and decrease the uncertaintyof individual classifier. Furthermore, the difficulty of designing a classifier and the highcomputation burden derived from high dimension vector can be avoided. Theexperimental results show that the presented method is better than any individualclassifier, and is more effective and robust thanother methods.
Keywords/Search Tags:Protein Subcellular Localization, Support Vector Machines, Multiple Classifier System, Moment Descriptor, Amino Acid Composition Distribution, Multi-Scale Energy
PDF Full Text Request
Related items