Font Size: a A A

Research On Feature Representation And Dimension Reduction Algorithm In Protein Subcellular Localization

Posted on:2019-03-24Degree:MasterType:Thesis
Country:ChinaCandidate:Y T YueFull Text:PDF
GTID:2370330548474404Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the advent of the post-genome era,as an important branch of proteomics,research of protein subcellular location is increasing with the day.In the study of protein subcellular localization,protein feature representation which is based on protein amino acid sequence plays a crucial part;it determines the positioning quality to a large extent.After extracting the protein feature expression,researchers usually will face with such a problem,namely "small sample vs.high dimensionality".Therefore,to reduce the computational complexity,noise in data and to enhance the robustness of small sample dataset,utilizing dimensionality reduction algorithm to process the high-dimensional feature representation is necessary.To this end,this thesis made an in-depth study and analysis for both the feature expression and the dimensionality reduction algorithm in protein subcellular localization.The main work and innovation of this thesis are summarized as below:1.There exist in 4 fundamental protein feature representations:the amino acid composition(AAC),dipeptide composition(DipC),pseudo-amino acid composition(PseAAC)and the position-specific scoring matrix(PSSM),among which the classification performance increases successively.In order to improve the prediction accuracy of protein subcellular location,constructing informative feature expression is one of the effective methods.In view of this point,the thesis here proposed a new integration model by weighting several single feature expressions firstly and then adding them to form a new composite feature representation.The experiment results demonstrate this new composite feature representation contains more information than those single feature expressions that were used to fuse it.Next,this thesis newly proposed a feature representation,named correlation position-specific scoring matrix(CoPSSM),based on the PSSM matrix.By experiment,classification performance of the proposed new representation CoPSSM is superior to the commom PseAAC and PSSM.2.Both the kernel principal component analysis(KPCA)and the kernel linear discriminate analysis(KLDA)are two commonly used nonlinear reduction algorithms.In actual application,different kernel functions and various kernel parameters of them will have a marked impact on the dimension reduction effect.Inspired by it,this thesis firstly studied the distinction of dimension reduction effect between the single kernel function and the composite kernel function;and then,for the selection of optimal kernel window width parameter,this thesis artfully proposed a new distance discriminate criterion,thereby transforming the unsupervised learning reductive dimension algorithm KPCA into another sort of semi-supervised dimensionality reduction method.Next,this thesis put forward a new optimization algorithm named dichotomous greedy genetic algorithm(DGGA)on the basis of genetic algorithm(GA)to intelligently search the bandwidth parameter of KLDA via combining with the proposed distance discriminate criterion,and then realize to predict protein subcellular location.The experiment results showed that the proposed new distance discriminate criterion and DGGA optimization algorithm is of efficiency.
Keywords/Search Tags:Protein subcellular localization, Feature representation, Nonlinear dimensionality reduction, Distance discriminate criterion, DGGA
PDF Full Text Request
Related items